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Abstract 

Today's hardware technology presents a new challenge in designing robust systems. Deep 
submicron VLSI technology introduced transient and permanent faults that were never consid- 
ered in low-lcvcl system designs in the past. Still, robustness of that part of the system is crucial 
and needs to be guaranteed for any successful product. Distributed systems, on the other hand, 
have been dealing with similar issues for decades. However, neither the basic abstractions nor 
the complexity of contemporary fault-tolerant distributed algorithms match the peculiarities of 
hardware implementations. 

This paper is intended to be part of an attempt striving to overcome this gap between theory 
and practice for the clock synchronization problem. Solving this task sufficiently well will allow to 
build a very robust high-precision clocking system for hardware designs like systems-on-chips in 
critical applications. As our first building block, we describe and prove correct a novel Byzantine 
fault-tolerant self-stabilizing piilsc synchronization protocol, which can be implemented using 
standard asynchronous digital logic. Despite the strict limitations introduced by hardware 
designs, it offers optimal resilience and smaller complexity than all existing protocols. 



1 Introduction & Related Work 



With today's deep submicron technology running at GHz clock speeds [20] , disseminating the high- 
speed clock throughout a very large scale integrated (VLSI) circuit, with negligible skew, is difficult 
and costly [21 [3l [l2l [2ll [29] . Systems-on-chip are hence increasingly designed globally asynchronous 
locally synchronous (GALS) [3] , where different parts of the chip use different local clock signals. 
Two main types of clocking schemes for GALS systems exist, namely, (i) those where the local clock 
signals are unrelated, and (ii) multi-synchronous ones that provide a certain degree of synchrony 
between local clock signals [301 [31] • 

GALS systems clocked by type (i) permanently bear the risk of metastable upsets when con- 
veying information from one clock domain to another. To explain the issue, consider a physical 
implementation of a bistable storage element, like a register cell, which can be accessed by read 
and write operations concurrently. It can be shown that two operations (like two writes with dif- 
ferent values) occurring very closely to each other can cause the storage cell to attain neither of its 
two stable states for an unbounded time [23j, and thereby, during an unbounded time afterwards, 
successive reads may return none of the stable states. Although the probability of a single upset 
is very small, one has to take into account that every bit of transmitted information across clock 
domains is a candidate for an upset. Elaborate synchronizers [51 [25] are the only means for 
achieving an acceptably low probability for metastable upsets here. 

This problem can be circumvented in clocking schemes of type (ii): Common synchrony prop- 
erties offered by multi-synchronous clocking systems are: 

• bounded skew, i.e., bounded maximum time between the occurence of any two matching clock 
transitions of any two local clock signals. Thereby, in classic clock synchronization, two clock 
transitions are matching iff they are both the k^^, k > 1, clock transition of a local clock. 

• bounded accuracy, i.e., bounded minimum and maximum time between the occurence of any 
two successive clock transitions of any local clock signal. 

Type (ii) clocking schemes are particularly beneficial from a designer's point of view, since they 
combine the convenient local synchrony of a GALS system with a global time base across the whole 
chip. It has been shown in [27j that these properties indeed facilitate metastability-free high-speed 
communication across clock domains. 

The decreasing structure sizes of deep submicron technology also resulted in an increased like- 
lihood of chip components failing during operation: Reduced voltage swing and smaller critical 
charges make circuits more susceptible to ionized particle hits, crosstalk, and electromagnetic in- 
terference [21 [18]. Fault-tolerance hence becomes an increasingly pressing issue in chip design. 
Unfortunately, faulty components may behave non-benign in many ways. They may perform signal 
transitions at arbitrary times and even convey inconsistent information to their successor compo- 
nents if their outgoing communication channels are affected by a failure. This forces to model faulty 
components as unrestricted, i.e., Byzantine, if a high fault coverage is to be guaranteed. 

The DARTS fault-tolerant clock generation approach [151 [H] developed by some of the authors 
of this paper is a Byzantine fault-tolerant multi-synchronous clocking scheme. DARTS comprises a 
set of modules, each of which generates a local clock signal for a single clock domain. The darts 
modules (nodes) are synchronized to each other to within a few clock cycles. This is achieved 
by exchanging binary clock signals only, via single wires. The basic idea behind darts is to 
employ a simple fault-tolerant distributed algorithm [SS] — based on Srikanth &: Toueg's consistent 
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broadcasting primitive [3T] — implemented in asynchronous digital logic. An important property of 
the DARTS clocking scheme is that it guarantees that no metastable upsets occur during fault-free 
executions. For executions with faults, metastable upsets cannot be ruled out: Since Byzantine 
faulty components are allowed to issue unrelated read and write accesses by definition, the same 
arguments as for clocking schemes of type (i) apply. However, in [13J , it was shown that by proper 
chip design the probability of a Byzantine component leading to a metastable upset of darts can 
be made arbitrarily small. 

Although both theoretical analysis and experimental evaluation revealed many attractive ad- 
ditional features of darts, like guaranteed startup, automatic adaption to current operating con- 
ditions, etc., there is room for improvement. The most obvious drawback of darts is its inability 
to support late joining and restarting of nodes, and, more generally, its lack of self-stabilization 
properties. If, for some reasons, more than a third of the DARTS nodes ever become faulty, the 
system cannot be guaranteed to resume normal operation even if all failures cease. Even worse, 
simple transient faults such as radiation- or crosstalk-induced additional (or omitted) clock ticks 
accumulate over time to arbitrarily large skews in an otherwise benign execution. 

Byzantine-tolerant self-stabilization, on the other hand, is the major strength of a number of 
protocols P3 El El [T^ [22] primarily devised for distributed systems. Of particular interest in the 
above context is the work on self-stabilizing pulse synchronization, where the purpose is to generate 
well-separated anonymous pulses that are synchronized at all correct nodes. This facilitates self- 
stabilizing clock synchronization, as agreement on a time window permits to simulate a synchronous 
protocol in a bounded-delay system. Beyond optimal (i.e., [n/S] — 1, c.f. [26]) resilience, an 
attractive feature of these protocols is a small stabilization time [U EJ [HI |22], which is crucial 
for applications with stringent availability requirements. In particular, [1] synchronizes clocks 
in expected constant time in a synchronous system. Given any pulse synchronization protocol 
stabilizing in a bounded-delay system in expected time T, this implies an expected (T + 0(1))- 
stabilizing clock synchronization protocol. 

Nonetheless, it remains open whether a (with respect to the number of nodes n) sublinear 
convergence time can be achieved: While the classical consensus lower bound of / + 1 rounds 
for synchronous, deterministic algorithms in a system with / < n/3 faults [11] proves that exact 
agreement on a clock value requires at least f + 1 £ Q{n) deterministic rounds, one has to face the 
fact that only approximate agreement on the current time is achievable in a bounded-delay system 
anyway. However, no non-trivial lower bounds on approximate deterministic synchronization or 
the exact problem with randomization are known by now. 

Note that existing synchronization algorithms, in particular those that do not rely on pulse 
synchronization, have deficiencies rendering them unsuitable in our context. For example, they 
have exponential convergence time [9|, require the relative drift of the nodes' local clocks to be very 
small [3 [22] provide larger skew only [22] or make use of linear-sized messages [6]. Furthermore, 
standard models used by the distributed systems community do not account for metastability, 
resulting in the same to be true for the existing solutions. 

It is hence natural to explore ways of combining and extending the above lines of research. The 
present paper is the first step towards this goal. 

Detailed contributions. We describe and prove correct the novel FATAL pulse synchroniza- 

^Note that it is too costly and space consuming to equip each node with a quartz oscillator. Simple digital 
oscillators, like inverters with feedback, in turn exhibit drifts of at least several percent, which heavily vary with 
operating conditions. 
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tion protocol, which facihtates a direct implementation in standard asynchronous digital logic. It 
self-stabilizes within 0{n) time with probability 1 — 2"~'^j^ in the presence of up to [n/3] — 1 
Byzantine faulty nodes, and is metastability-free by construction after stabilization in failure-free 
runs. While executing the protocol, non-faulty nodes broadcast a constant number of bits in con- 
stant time. In terms of distributed message complexity, this implies that stabilization is achieved 
after broadcasting 0{n) messages of size C(l), improving by factor r2(n) on the number of bits 
transmitted by previous algorithms]^ The protocol can sustain large relative clock drifts of more 
than 10%, which is crucial if the local clock sources are simple ring oscillators (uncompensated ring 
oscillators suffer from clock drifts of up to 9% [32]). If the number of faults is not overwhelming, 
i.e., a majority of at least n — f nodes continues to execute the protocol in an orderly fashion, re- 
covering nodes and late joiners (re) synchronize in constant time. This property is highly desirable 
in practical systems, in particular in combination with Byzantine fault-tolerance: Even if nodes 
randomly experience transient faults on a regular basis, quick recovery ensures that the mean time 
until failure of the system as a whole is substantially increased. All this is achieved against a 
powerful adversary that, at time t, knows the whole history of the system up to time t + e (where 
e > is infinitesimally small) and does not need to choose the set of faulty nodes in advance. 
Apart from bounded drifts and communication delays, our solution solely requires that receivers 
can unambiguously identify the sender of a message, which is a property that arises naturally in 
hardware designs. 

We also describe how the pulse synchronization protocol can be implemented using asynchronous 
digital logic. Moreover, we sketch how the pulse synchronization protocol will be integrated with 
DARTS clocks to build a high-precision self-stabilizing clocking system for multi-synchronous GALS. 
The basic idea of our integration is to let the pulse synchronization protocol non-intrusively mon- 
itor the operation of darts clocks and to recover darts clocks that run abnormally. Like the 
original DARTS, the joint system is metastability-free in failure-free runs after stabilization. During 
stabilization, the fact that nodes merely undergo a constant number of state transitions in constant 
time ensures a very small probability of metastable upsets. 

2 Model 

Our formal framework will be tied to the peculiarities of hardware designs, which consist of modules 
that continuousl]]^ compute their output signals based on their input signals. Following |14| I16j. 
we define (the trace of) a signal to be a timed event trace over a finite alphabet S of possible signal 
states: Formally, signal o" C S x M.q . All times and time intervals refer to a global reference time 
taken from M^j", that is, signals describe the system's behaviour from time on. The elements of a 
are called events, and for each event (s, t) we call s the state of event (s, t) and t the time of event 
{s,t). In general, a signal a is required to fulfill the following conditions: (i) for each time interval 
t^] C R(j" of finite length, the number of events in a with times within t^] is finite, (ii) from 
(s,t) G a and {s',t) E a follows that s = s' , and (iii) there exists an event at time in a. 

^Note that the algorithm from [T] achieving an expected constant stabiUzation time in a synchronous model needs 
to run for il{n) rounds to ensure the same probability of stabilization. 

^We remark that [21] achieves the same complexity, but considers a much simpler model. In particular, all 
communication is restricted to broadcasts, i.e., all nodes observe the same behaviour of a given other node, even if it 
is faulty. 

*In sharp contrast to classic distributed computing models, there is no computationally complex discrete zero-time 
state-transition here. 
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Note that our definition allows for events {s,t) and {s,t') E a, where t < t' , without having an 
event {s',t") G a with s' ^ s and t < t" < t' . In this case, we call event idempotent. Two 

signals a and a' are equivalent, iff they differ in idempotent events only. We identify all signals of 
an equivalence class, as they describe the same physical signal. Each equivalence class [a] of signals 
contains a unique signal ao having no idempotent events. We say that signal a switches to s at 
time t iff event {s,t) E ao. 

The state of signal a at time t G M^, denoted by is given by the state of the event with 
the maximum time not greater than tr] Because of (i), (ii) and (iii), a{t) is well defined for each 
time t G M^. Note that cr's state function in fact depends on [a] only, i.e., we may add or remove 
idempotent events at will without changing the state function. 

Distributed System On the topmost level of abstraction, we see the system as a set oi V = 
{1, . . . ,n} physically remote nodes that communicate by means of channels. In the context of a 
VLSI circuit, "physically remote" actually refers to quite small distances (centimeters or even less). 
However, at gigahertz frequencies, a local state transition will not be observed remotely within 
a time that is negligible compared to clock speeds. We stress this point, since it is crucial that 
different clocks (and their attached logic) are not too close to each other, as otherwise they might 
fail due to the same event such as a particle hit. This would render it pointless to devise a system 
that is resilient to a certain fraction of the nodes failing. 

Each node i comprises a number of input ports, namely Sij for each node j, an output port 
Si, and a set of local ports, introduced later on. An execution of the distributed system assigns to 
each port of each node a signal. For convenience of notation, for any port p, we refer to the signal 
assigned to port p simply by signal p. We say that node i is in state s at time t iff Si{t) = s. We 
further say that node i switches to state s at time t iff signal Si switches to s at time t. 

Nodes exchange their states via the channels between them: for each pair of nodes i,j, output 
port Si is connected to input port Sj^i by a FIFO channel from i to j. Note that this includes a 
channel from i to i itself. Intuitively, Si being connected to Sj^i by a (non-faulty) channel means 
that Sj^i{-) should mimic Si{-), however, with a slight delay accounting for the time it takes the 
signal to propagate. In contrast to an asynchronous system, this delay is bounded by the maximum 
delay d > 00 

Formally we define: The channel from node i to j is said to be correct during [t",^'^] iff 
there exists a function Tjj : — t- M^, called the channel's delay function, such that: (i) Tjj is 
continuous and strictly increasing, (ii) Vt G : < Tij{t) — t < d, and (iii) for each t G 

{s,Tij{t)) G Sj^i 4^ {s,t) G Si. We say that node i observes node j in state s at time t if Sij{t) = s. 

Clocks and Timeouts Nodes are never aware of the current reference time and we also do not 
require the reference time to resemble Newtonian "real" time. Rather we allow for physical clocks 
that run arbitrarily fast or slow, as long as their speeds are close to each other in comparison. One 
may hence think of the reference time as progressing at the speed of the currently slowest correct 
clock. In this framework, nodes essentially make use of bounded clocks with bounded drift. 

Formally, clock rates are within [l,"!?] (with respect to reference time), where f? > 1 is constant 
and "i? — 1 is the (maximum) clock drift. A clock C is a continuous, strictly increasing function 

^To facilitate intuition, we here slightly abuse notation, as this way a denotes both a function of time and the 
signal (trace), which is a subset of S x RJ. Whenever referring to a, we will talk of the signal, not the state function. 
®With respect to O-notation, we normalize d G 0(1), as all time bounds simply depend linearly on d. 
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C : M.Q —7- mapping reference time to some local time. Clock C is said to be correct during 
[t~,t+] C M+ iff we have for any t,t' € [t',t+], t < t' , that t' - t < C{t') - C{t) < - t). 
Each node comprises a set of clocks assigned to it, which allow the node to estimate the progress 
of reference time. 

Instead of directly accessing the value of their clocks, nodes have access to so-called timeout 
ports of watchdog timers. A timeout is a triple (T, s,C), where T € is a duration, s G S is a 
state, and C is a clock, say of node i. Each timeout (T, s, C) has a corresponding timeout port 
TimeT,5,C; being part of node z's local ports. Signal TimeT,s,c is Boolean, that is, its possible states 
are from the set {0, 1}. We say that timeout (T, s,C) is correct during C iff clock C is 

correct during and the following holds: 

1. For each time tg G [i";*"*"] when node i switches to state s, there is a time t G \ts,ri^i{ts)\ 
such that (T, s,C) is reset, i.e., (0,t) G TimeT,s,c- This is a one-to-one correspondence, i.e., 
(T, s,C) is not reset at any other times. 

2. For a time t G denote by to the supremum of all times from when (T,s,C) is 
reset. Then it holds that G TimeT,s,c iff C{t) — C{to) = T. Again, this is a one-to-one 
correspondence. 

We say that timeout (T, s, C) expires at time t iff TimeT,s,C switches to 1 at time t, and it is 
expired at time t iff TimeT,s,c(i) = 1- For notational convenience, we will omit the clock C and 
simply write (T, s) for both the timeout and its signal. 

A randomized timeout is a triple {T>,s,C), where "D is a bounded random distribution on M^, 
s G S is a state, and C is a clock. Its corresponding timeout port Time©, s,c behaves very similar to 
the one of an ordinary timeout, except that whenever it is reset, the local time that passes until it 
expires next — provided that it is not reset again before that happens — follows the distribution T>. 
Formally, {V,s,C) is correct during [t",*"*"] C Mj, if C is correct during and the following 

holds: 

1. For each time tg G [t",*"*"] when node i switches to state s, there is a time t G \ts,Ti^i{ts)\ 
such that (D, s,C) is reset, i.e., (0,t) G Timex),s,c- This is a one-to-one correspondence, i.e., 

{T), s, C) is not reset at any other times. 

2. For a time t G [t^,t^], denote by t^ the supremum of all times from [t^,t] when [V^s^C) is 
reset. Let fx : M.Q — )• U.q denote the density of V. Then G TimeD^^^c; "with probability 
IJi{C{t) — C(to))" and we require that the probability of G Timex),s,c — conditional to 
to and C on [tQ,f] being given — is independent of the system's state at times smaller than t. 
More precisely, if superscript £ identifies variables in execution £ and t'^ is the infimum of all 
times from (to, t"*"] when node i switches to state s, then we demand for any [r~, r+j C [to, tg] 
that 

3t' G [r-,r+] : (l,t') G Time^,,,,c | = «o A C\l^ ^,^ = = j^^ n{CiT)-C(to)) dr, 



independently of £^ | 



We will apply the same notational conventions to randomized timeouts as we do for regular 
timeouts. 
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Note that, strictly speaking, this definition does not induce a random variable describing the 
time t' G [^Oj^o) satisfying that G TimeD^s,c- However, for the state of the timeout port, we 

get the meaningful statement that for any t' G [to,to). 



The reason for phrasing the definition in the above more cumbersome way is that we want to 
guarantee that an adversary knowing the full present state of the system and memorizing its whole 
history cannot reliably predict when the timeout will expire]^ 

We remark that these definitions allow for different timeouts to be driven by the same clock, 
implying that an adversary may derive some information on the state of a randomized timeout 
before it expires from the node's behaviour, even if it cannot directly access the values of the clock 
driving the timeout. This is crucial for implement ability, as it might be very difficult to guarantee 
that the behaviour of a dedicated clock that drives a randomized timeout is indeed independent of 
the execution of the algorithm. 

Memory Flags Besides timeout and randomized timeout ports, another kind of node z's local 
ports are memory flags. For each state s G S and each node j G Memjj^s is a local port of 
node i. It is used to memorize whether node i has observed node j in state s since the last reset of 
the flag. We say that node i memorizes node j in state s at time t if Memjj^s(t) = 1. Formally, we 
require that signal Memjj^s switches to 1 at time t iff node i observes node j in state s at time t 
and Memjj-^s is not already in state 1. The times t when Meuiij^s is reset, i.e., (0,t) G Memjj^s, 
are specified by node fs state machine, which is introduced next. 

State Machine It remains to specify how nodes switch states and when they reset memory flags. 
We do this by means of state machines that may attain states from the finite alphabet S. A node's 
state machine is specified by (i) the set S, (ii) a function tr, called the transition function, from 
T C §2 to the set of Boolean predicates on the alphabet consisting of expressions "p = s" (used 
for expressing guards), where p is from the node's input and local ports and s is from the set of 
possible states of signal p, and (iii) a function re, called the reset function, from T to the power 
set of the node's memory flags. 

Intuitively, the transition function specifies the conditions (guards) under which a node switches 
states, and the reset function determines which memory flags to reset upon the state change. 
Formally, let P be a predicate on node i's input and local ports. We define P holds at time t by 
structural induction: If P is equal to p = s, where p is one of node i's input and local ports and 
s is one of the states signal p can obtain, then P holds at time t iff p{t) = s. Otherwise, if P is of 
the form ^Pi, Pi A P2, or Pi V P2, we define P holds at time t in the straightforward manner. 

We say node i follows its state machine during [t~,t~^] iff the following holds: Assume node i 
observes itself in state s G S at time t G i.e., Si^i{t) = s. Then, for each (s, s') G T, both: 

1. Node i switches to state s' at time t iff tr{s, s') holds at time t and i is not already in state s'j^ 

'^This is a non-trivial property. For instance nodes could just determine, by drawing from the desired random 
distribution at time to, at which local clock value the timeout shall expire next. This would, however, essentially give 
away early when the timeout will expire, greatly reducing the power of randomization! 

^In case more than one guard tr[s,s') can be true at the same time, we assume that an arbitrary tie-breaking 
ordering exists among the transition guards that specifies to which state to switch. 



P[Timex>^s,c switches to 1 during [toj^']] 
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2. Node i resets memory flag m at some time in the interval [t,Ti^i{t)] iff m G re{s,s') and i 
switches from state s to state s' at time t. This correspondence is one-to-one. 



A node is defined to be non-faulty during [t^ , f^] iff during , f^] all its timeouts and random- 
ized timeouts are correct and it follows its state machine. If it employs multiple state machines 
(see below), it needs to follow all of them. 

In contrast, a faulty node may change states arbitrarily. Note that while a faulty node may be 
forced to send consistent output state signals to all other nodes if its channels remain correct, there 
is no way to guarantee that this still holds true if channels are faultyP] 



Metastability In our discrete system model, the effect of metastability is captured by the lack- 
ing capability of state machines to instantaneously take on new states: Node i decides on state 
transitions based on the delayed status of port Si,i instead of its "true" current state Si. This 
non-zero delay from Si to Si^i bears the potential for metastability, as a successful state transition 
can only be guaranteed if after a transition guard from some state s to some state s' becomes true, 
all other transition guards from s to s" ^ s' remain false during this delay at least. 

This is exemplified in the following scenario: Assume node i is in state s at some time t. 
However, since it switched to s only very recently, it still observes itself in state s' 7^ s at time t 
via Si^i. Given that there is a transition {s',s") in T, s" 7^ s, whose condition is fulfilled at time 
t, it will switch to state s" at time t (although state s has not even stabilized yet). That is, due 
to the discrepancy between Si^i and Si, node i switches from state s to state s" at time t even if 



(s, s") is not in T at all In a physical chip design, this premature change of state might even 
result in inconsistent operations on the local memory, up to the point where it cannot be properly 
described in terms of S, and thus in terms of our discrete model, anymore. Even worse, the state 
of i is part of the local memory and the node's state signal may attain an undefined value that is 
propagated to other nodes and their memory. While avoiding the latter is the task of the input 
ports of a non-faulty node, our goal is to prevent this erroneous behaviour in situations where input 
ports attain legitimate values only. 

Therefore, we define node i to be metastability-free, if the situation described above does not 
occur. 

Definition 2.1 (Metastability- Freedom) . Node i €z V is called metastability-free during 

iff for each time t S when i switches to some state s G S, it holds that Ti^i{t) < t' , where t' 

is the infimum of all times in (t, t"*"] when i switches to some state s' G S. 



Multiple State Machines In some situations the previous definitions are too stringent, as there 
might be different "components" of a node's state machine that act concurrently and independently, 
mostly relying on signals from disjoint input ports or orthogonal components of a signal. We model 
this by permitting that nodes run several state machines in parallel. All these state machines share 
the input and local ports of the respective node and are required to have disjoint state spaces. If 
node i runs state machines Mi, . . . , M^, node i's output signal is the product of the output signals 

®A single physical fault may cause this behaviour, as at some point a node's output port must be connected to 
remote nodes' input ports. Even if one places bifurcations at different physical locations striving to mitigate this 
effect, if the voltage at the output port drops below specifications, the values of corresponding input channels may 
deviate in unpredictable ways. 

^°Note that while the "internal" delay Ti^iit) — t can be made quite small, it cannot be reduced to zero if the model 
is meant to refiect physical implementations. 
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of the individual machines. Formally we define: Each of the state machines Mj, 1 < j < k, has 
an additional own output port sj. The state of node i's output port Si at any time t is given 
by Si{t) := {si{t), . . . ,Sk{t)), where the signals of ports si, . . . , Sfc are definied analogously to the 
signals of the output ports of state machines in the single state machine case, each. Note that by 
this definition, the only (local) means for node i's state machines to interact with each other is by 
reading the delayed state signal Si^i. 

We say that node i's state machine Mj is in state s at time t iff Sj{t) = s, where Si{t) = 
(si(t), . . . , Sfc(t)), and that node i's state machine Mj switches to state s at time t iff signal Sj 
switches to s at time t. Since the state spaces of the machines Mj are disjoint, we will omit the 
phrase "state machine Mj" from the notation, i.e., we write "node i is in state s" or "node i 
switched to state s" , respectively. 

Recall that the various state machines of node i are as loosely coupled as remote nodes, namely 
via the delayed status signal on channel Si^i only. Therefore, it makes sense to consider them 
independently also when it comes to metastability. 

Definition 2.2 (Metastability-Freedom (Multiple State Machines)). State machine M of node 
i €z V is called metastability-free during iff for each time t € when M switches to 

some state s € S, it holds that Ti^i{t) < t' , where t' is the infimum of all times in (t, t^] when M 
switches to some state s' G §. 

Note that by this definition the different state machines may switch states concurrently without 
suffering from metastabilityj^ It is even possible that some state machine suffers metastability, 
while another is not affected by this at allj^ 

Problem Statement The purpose of the pulse synchronization protocol is that nodes generate 
synchronized, well-separated pulses by switching to a distinguished state accept. Self-stabilization 
requires that they start to do so within bounded time, for any possible initial state. However, as 
our protocol makes use of randomization, there are executions where this does not happen at all; 
instead, we will show that the protocol stabilizes with probability one in finite time. To give a 
precise meaning to this statement, we need to define appropriate probability spaces. 

Definition 2.3 (Adversarial Spaces). Denote for i G V by Ci = {Ci^k\k G {!,..., Cj}} the set 
of clocks of node i. An adversarial space is a probabilistic space that is defined by subsets of 
nodes and channels W V and E Q V x V , a time interval [t~,t^], a protocol V (nodes' ports, 
state machines, etc.) as previously defined, sets of clock and delay functions C = Ujey '^^'^ 
= {rjj : Mq" — )• M.Q I {i,j) G V^}, an initial state £q of all ports, and an adversarial function A. 
Here A is a function that maps a partial execution £\[o^t] until time t (i.e., all ports' values until 
time t), W, E, [t~ ^f^], V, C, and to the states of all faulty ports during the time interval {t,t'], 
where t' is the infimum of all times greater than t when a non-faulty node or channel switches 
states. 



'However, care has to be taken when implementing the inter- node communication of the state components in a 



metastability-free manner, cf. Section 6 

'■^This is crucial for the algorithm we are going to present. For stabilization purposes, nodes comprise a state 
machine that is prone to metastability. However, the state machine generating pulses (i.e., having the state accept, 



cf. Definition 2.4 1 does not take its output signal into account once stabilization is achieved. Thus, the algorithm is 
metastability-free after stabilization in the sense that we guarantee a metastability-free signal indicating when pulses 
occur. 
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The adversarial space AS{W, E, [t , t^], "P, C,Q,£q,A) is now defined on the set of all executions 
E satisfying that (i) the initial state of all ports is given by <?|[o,o] = ^o, (H) for all i £ V and 
k E {l,...,Ci} : Cfj^ = Ci^k, i'i'ii) for all G V'^, rfj = nj, [iv) nodes in W are non-faulty 
during [t~ ,t~^] with respect to the protocol V , {v) all channels in E are correct during [t~ ^f^], and 
{vi) given <?|[o,t] for any time t, E\{t,t'] given by A, where t' is the infimum of times greater than 
t when a non-faulty node switches states. Thus, except for when randomized timeouts expire, E is 



fully predetermined by the parameters of AS The probability measure on AS is induced by the 



random distributions of the randomized timeouts specified by V . 

To avoid confusion, observe that if the clock functions and delays do not follow the model 
constraints during the respective adversarial space is empty and thus of no concern. This 

cumbersome definition provides the means to formalize a notion of stabilization that accounts for 
worst-case drifts and delays and an adversary that knows the full state of the system up to the 
current time. 

We are now in the position to formally state the pulse synchronization problem in our framework. 
Intuitively, the goal is that after transient faults cease, nodes should with probability one eventually 
start to issue well-separated, synchronized pulses by switching to a dedicated state accept. Thus, 
as the initial state of the system is arbitrary, specifying an algorithixp^ is equivalent to defining the 
state machines that run at each node, one of which has a state accept. 

Definition 2.4 (Self-Stabilizing Pulse Synchronization). Given a set of nodes W Q V and a set 
E Q V xV of channels, we say that protocol V is a {W, £')-stabilizing pulse synchronization protocol 
with skew S and accuracy bounds T~ , T^ that stabilizes within time T with probability p iff the 
following holds. Choose any time interval [t^ , f^] 5 [t~ ,t^ + T + T,] and any adversarial space 
AS{W, E,[t~ jf^j^V, ■) (i.e., C, Q, So, and A are arbitrary). Then executions from AS satisfy 
with probability at least p that there exists a time ts € [t~ ,t~ + T] so that, denoting by ti{k) the 
time when node i switches to a distinguished state accept for the k^^ time after tg (ti{k) = 00 if no 
such time exists), (i) ti{l) G {ts,ts + (ii) \ti{k) — tj{k)\ < S if inax{ti{k),tj(k)} < t^ , and (Hi) 
T- < \ti{k + 1) - ti{k)\ < T+ ifti{k) + T+ < t+. 

Note that the fact that ^ is a deterministic function and, more generally, that we consider each 
space AS individually, is no restriction: As V succeeds for any adversarial space with probability 
at least p in achieving stabilization, the same holds true for randomized adversarial strategies A 
and worst-case drifts and delays. 



3 The FATAL Pulse Synchronization Protocol 

In this section, we present our self-stabilizing pulse generation algorithm. In order to be suitable 
for implementation in hardware, it needs to utilize very simple rules only. It is stated in terms of 
a state machine as introduced in the previous section. 

Since the ultimate goal of the pulse generation algorithm is to stabilize a system of darts 
clocks, we introduce an additional port DARTSj, for each node i, which is driven by node i's DARTS 

^^This follows by induction starting from the initial configuration fo- Using A, we can always extend £ to the next 
time when a correct node switches states, and when correct nodes switch states is fully determined by the parameters 
of AS except for when randomized timeouts expire. Note that the induction reaches any finite time within a finite 
number of steps, as signals switch states finitely often in finite time. 

^*We use the terms "algorithm" and "protocol" interchangably throughout this work. 
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Figure 1: Basic cycle of node i once the algorithm has stabilized. 



instance. As for other state signals, its output raises flag Menij darts i to which for simplicity 
we refer to as DARTSj as well. Note that the darts signals are of no concern to the liveliness or 
stabilization of the pulse algorithm itself; rather, it is a control signal from the darts component 
that helps in adjusting the frequency of pulses to the speed of the darts clocks once the system as a 
whole (including the DARTS component) is stable. The pulse algorithm will stabilize independently 
of the DARTS signal, and the darts component will stabilize once the pulse component did so. 
Therefore we can partition the algorithm's analysis into two parts. When proving the correctness 
of the algorithm in Section 4, we assume that for each node i, DARTSj is arbitrary. In Section 7, we 
will outline how the pulse algorithm and darts interact. 



3.1 Basic Cycle 

The full algorithm makes use of a rather involved interplay between conditions on timeouts, states, 
and thresholds to converge to a safe state despite a limited number of faulty components. As our 
approach is thus difficult to present in bulk, we break it down into pieces. Moreover, to facilitate 
giving intuition about the key ideas of the algorithm, in this section we assume that there are 
/ < n/3 faulty nodes, and the remaining n — f nodes are non- faulty within [0, oo) (where of course 
the time is unknown to the nodes). We further assume that channels between non-faulty nodes 
(including loopback channels) are correct within [0, oo). We start by presenting the basic cycle that 



is repeated every pulse once a safe configuration is reached (see Figure 1 ) . 

We employ graphical representations of the state machine of each node i £ V. States are 
represented by circles containing their names, while transition (s, s') G T is depicted as an arrow 
from s to s' . The guard tr{s,s') is written as a label next to the arrow, and the reset function's 
value re(s, s') is depicted in a rectangular box on the arrow. To keep labels more simple we make 
use of some abbreviations. We write T instead of (T, s) if s is the state which node i leaves if 
the condition involving (T, s) is satisfied. Threshold conditions like " > / + 1 s " , where s G S, 
abbreviate Boolean predicates that reach over all of node i's memory fiags Memjj^s, where j S V, 
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and are defined in a straiglitforward manner. If in sucli an expression we connect two states by 
"or", e.g., " > n — / s or s' " for s,s' G S, the summation considers flags of both types s and s' . 
Thus, such an expression is equivalent to ^^^y max{Memjj^<j, Memj j^s'} > / + 1- For any state 
s G §, the condition Sij = s, (respectively, ^{Sij = s)) is written in short as "j in s" (respectively, 
"j not in s"). If j = i, we simply write "(not) in s". We write "true" instead of a condition 
that is always true (like e.g. "(in s) or (not in s)" for an arbitrary state s G S). Finally, re(-,-) 
always requires to reset all memory flags of certain types, hence we write e.g. propose if all flags 
Meniij^propose are to be reset. 

We now briefly introduce the basic flow of the algorithm once it stabilizes, i.e., once all n — / 
non- faulty nodes are well-synchronized. Recall that the remaining up to / < n/3 faulty nodes may 
produce arbitrary signals on their outgoing channels. A pulse is locally triggered by switching to 
state accept. Thus, assume that at some time all non- faulty nodes switch to state accept within 
a time window of 2d, i.e., a valid pulse is generated. Supposing that Ti > S-dd, these nodes will 
observe, and thus memorize, each other and themselves in state accept before Ti expires. This makes 
timeout Ti the critical condition for switching to state sleep. From state sleep, they will switch 
to states sleep — )• waking, waking, and finally ready, where the timeout (T2, accept) is determining 
the time this takes, as it is considerably larger than + 2)Ti. The intermediate states serve 
the purpose of achieving stabilization, hence we leave them out for the moment. Note that upon 
switching to state ready, nodes reset their propose flags and DARTSj. Thus, they essentially ignore 
these signals between the most recent time they switched to propose before switching to accept and 
the subsequent time when they switch to ready. This ensures that nodes do not take into account 
outdated information for the decision when to switch to state propose. Hence, it is guaranteed that 
the first node switching from state ready to state propose again does so because T4 expired or because 
Ts expired and its darts memory flag is true. Due to the constraint min{r3,T4} > 'd{T2 + 4:d), we 
are sure that all non-faulty nodes observe themselves in state ready before the first one switches 
to propose. Hence, no node deletes information about nodes that switch to propose again after the 
previous pulse. The first non-faulty node that switches to state accept again cannot do so before 
it memorizes at least n — f nodes in state propose, as the accept flags are reset upon switching to 
state propose. Therefore, at this time at least n — 2/ > / + 1 non- faulty nodes are in state propose. 
Hence, the rule that nodes switch to propose if they memorize / -|- 1 nodes in states propose will 
take effect, i.e., the remaining non- faulty nodes in state ready switch to propose after less than d 
time. Another d time later all non-faulty nodes in state propose will have become aware of this 
and switch to state accept as well, as the threshold of n — / nodes in states propose or accept is 
reached. Thus the cycle is complete and the reasoning can be repeated inductively. 

Clearly, for this line of argumentation to be valid, the algorithm could be simpler than stated in 
Figure ij We already mentioned that the motivation of having three intermediate states between 
accept and ready is to facilitate stabilization. Similarly, there is no need to make use of the accept 
flags in the basic cycle at all; in fact, it adversely affects the constraints the timeouts need to 
satisfy for the above reasoning to be valid. However, the accept flags are much better suited for 
diagnostic purposes than the propose flags, since nodes are expected to switch to accept in a small 
time window and remain in state accept for a small period of time only (for all our results, it is 
sufficient if Ti = A'dd). Moreover, two different timeout conditions for switching from ready to 
propose are unnecessary for correct operation of the pulse synchronization routine. As discussed 
before, they are introduced in order to allow for a seamless coupling to the darts system. We 
elaborate on this in [Section 71 
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Figure 2: Overview of the core routine of node i's self-stabilizing pulse algorithm. 



3.2 Main Algorithm 

We proceed by describing the main routine of the pulse algorithm in full. Alongside the main 
routine, several other state machines run concurrently and provide additional information to be 
used during recovery. 



The main routine is graphically presented in Figure 2, together with a very simple second 
component whose sole purpose is to simplify the otherwise overloaded description of the main 
routine. Except for the states recover and join and additional resets of memory flags, the main 
routine is identical to the basic cycle. The purpose of the two additional states is the following: 
Nodes switch to state recover once they detect that something is wrong, that is, non-faulty nodes 



do not execute the basic cycle as outlined in Section 3.1, This way, non- faulty nodes will not 



continue to confuse others by sending for example state signals propose or accept despite clearly 
being out-of-sync. There are various consistency checks that nodes perform during each execution 
of the basic cycle. The first one is that in order to switch from state accept to state sleep, non- faulty 
nodes need to memorize at least n — f nodes in state accept. If this does not happen within Ti 



time after switching to state accept, by the arguments given in Section 3.1 they could not have 
entered state accept within 2d of each other. Therefore, something must be wrong and it is feasible 
to switch to state recover. Next, whenever a non-faulty node is in state waking, there should be 
no non-faulty nodes in states accept or recover. Considering that the node resets its accept and 
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recover flags upon switching to waking, it should not memorize / + 1 or more nodes in states accept 
or recover at a time when it observes itself in state waking. If it does, however, it again switches 
to state recover. Similarly, when in state ready, nodes expect others not to be in state accept for 
more than a short period of time, as a non-faulty node switching to accept should imply that every 
non-faulty node switches to propose and then to accept shortly thereafter. This is expressed by 
the second state machine comprising two states only. If a node is in state ready and memorizes 
f + 1 nodes in state accept, it switches to suspect. Subsequently, if it remains in state ready until 
a timeout of 2'dd expires, it will switch to state recover. Last but not least, during a synchronized 
execution of the basic cycle, no non-faulty node may be in state propose for more than a certain 
amount of time before switching to state accept. Therefore, nodes will switch from propose to 
recover when timeout T5 expires. 

Nodes can join the basic cycle again via the second new state, called join. Since the Byzantine 
nodes may "play nice" towards n — 2/ or more nodes still executing the basic cycle, making 
them believe that system operation continues as usual, it must be possible to join the basic cycle 
again without having a majority of nodes in state recover. On the other hand, it is crucial that 
this happens in a sufHciently well-synchronized manner, as otherwise nodes could drop out again 
because the various checks of consistency detect an erroneous execution of the basic cycle. 

In part, this issue is solved by an additional agreement step. In order to enter the basic 
cycle again, nodes need to memorize n — f nodes in states join (the respective nodes detected an 
inconsistency), propose (these nodes continued to execute the basic cycle), or accept (there are 
executions where nodes reset their propose flags because of switching to join when other nodes 
already switched to accept). Since there are thresholds of / + 1 nodes memorized in state join 
both for leaving state recover and switching from ready to join, all nodes will follow the first 
one switching from join to propose quickly, just as with the switch from propose to accept in an 
ordinary execution of the basic cycle. However, it is decisive that all nodes are in states that permit 
to participate in this agreement step in order to guarantee success of this approach. 

As a result, still a certain degree of synchronization needs to be established beforehand, both 
among nodes that still execute the basic cycle and those that do not. For instance, if at the point 
in time when a majority of nodes and channels become non-faulty, some nodes already memorize 
nodes in join that are not, they may switch to state join and subsequently propose prematurely, 
causing others to have inconsistent memory fiags as well. Again, Byzantine faults may sustain this 
amiss configuration of the system indefinitely. 

So why did we put so much effort in "shifting" the focus to this part of the algorithm? The key 
advantage is that nodes outside the basic cycle may take into account less reliable information for 
stabilization purposes. They may take the risk of metastable upsets (as we know it is impossible 
to avoid these during the stabilization process, anyway) and make use of randomization. 

In fact, to make the above scheme work, it is sufficient that all non- faulty nodes agree on a 
so called resynchronization point (formally defined later on), that is, a point in time at which 
nodes reset the memory flags for states join and sleep waking as well as certain timeouts, while 
guaranteeing that no node is in these states close to the respective reset times. Except for state 
sleep — )• waking, all of these timeouts, memory flags, etc. are not part of the basic cycle at all, thus 
nodes may enforce consistent values for them when they agree on such a resynchronization point. 

Conveniently, the use of randomization also ensures that it is quite unlikely that nodes are 
in state sleep — > waking close to a resynchronization point, as the consistency check of having to 
memorize n — f nodes in state accept in order to switch to state sleep guarantees that the time 
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windows during which non-faulty nodes may switch to sleep make up a small fraction of all times 
only. 

Consequently, the remaining components of the algorithm deal with agreeing on resynchroniza- 
tion points and utilizing this information in an appropriate way to ensure stabilization of the main 
routine. We describe this connection to the main routine first. It is done by another, quite simple 



state machine, which runs in parallel alongside the core routine. It is depicted in Figure 3 

Its purpose is to reset memory flags in a consistent way and to determine when a node is 
permitted to switch to join. In general, a resynchronization point (locally observed by switching 
to state resync, which is introduced later) triggers the reset of the join and sleep — )• waking flags. 
If there are still nodes executing the basic cycle, a node may become aware of it by observing / + 1 
nodes in state sleep — )• waking at some time. In this case it switches from the state passive, which 
it entered at the point in time when it locally observed the resynchronization point, to the state 
active, which enables an earlier transition to state join. This is expressed by the rather involved 
transition rule tr {recover, join): Tq is much smaller than Ty, but Tg is of no concern until the node 
switches to state active and resets Tr 



15 



It remains to explain how nodes agree on resynchronization points. 
3.3 Resynchronization Algorithm 



The resynchronization routine is specified in [Figured as well. It is a lower layer that the core routine 



uses for stabilization purposes only. It provides some synchronization that is very similar to that of 
a pulse, except that such "weak pulses" occur at random times, and may be generated inconsistently 
after the algorithm as a whole has stabilized. Since the main routine operates independently of the 
resynchronization routine once the system has stabilized, we can afford the weaker guarantees of 
the routine: If it succeeds in generating a "good" resynchronization point merely once, the main 
routine will stabilize deterministically. 

Definition 3.1 (Resynchronization Points). Given W OV , time t is a VF-resynchronization point 
ijf each node in W switches to state supp — )• resync in the time interval {t,t + 2d). 

Definition 3.2 (Good Resynchronization Points). A W -resynchronization point is called good if 



^^The condition "not in dormant" here ensures that the transition is not performed because the node has been in 
state resync a long time ago, but there was no recent switching to resync. 
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Figure 4: Resynchronization algorithm, comprising two state machines executed in paraUel at 
node i. 



no node from W switches to state sleep during {t— + 3)Ti,t) and no node is in state join during 
[t-Ti- d,t + U). 

In order to clarify that despite having a linear number of states {suppi, . . . , supp^J, this part 
of the algorithm can be implemented using 2-bit communication channels between state machines 
only, we generalize our description of state machines as follows. If a state is depicted as a circle 
separated into an upper and a lower part, the upper part denotes the local state, while the lower part 
indicates the signal state to which it is mapped. A node's memory flags then store the respective 
signal states only, i.e., remote nodes do not distinguish between states that share the same signal. 
Clearly, such a machine can be simulated by a machine as introduced in the model section. The 
advantage is that such a mapping can be used to reduce the number of transmitted state bits; for 



the resynchronization routine given in Figure 4 we merely need two bits (init/wait and none /supp) 
instead of [log(n + 3)] + 1 bits. 

The basic idea behind the resynchronization algorithm is the following: Every now and then, 
nodes will try to initiate agreement on a resynchronization point. This is the purpose of the small 



state machine on the left in Figure 4 Recalling that the transition condition "true" simply means 
that the node switches to state wait again as soon as it observes itself in state init, it is easy to see 
that it does nothing else than creating an init signal as soon as R3 expires and resetting R3 again 
as quickly as possible. As the time when a node switches to init is determined by the randomized 



timeout R3 distributed over a large interval (cf. Equality ( 11 )) only, it is impossible to predict when 
it will expire, even with full knowledge of the execution up to the current point in time. Note that 
the complete independence of this part of node i's state from the remaining protocol implies that 
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faulty nodes are not able to influence the respective times by any means. 



Consider now the state machine displayed on the right of Figure 4 To understand how the 
routine is intended to work, assume that at the time t when a non-faulty node i switches to state 
init, all non- faulty nodes are not in any of the states supp — )• resync, resync, or supp i, and at 
all non-faulty nodes the timeout {R2, supp i) has expired. Then, no matter what the signals from 
faulty nodes or on faulty channels are, all non-faulty nodes will be in one of the states supp j, 
j G V, or supp —7- resync at time t + d. Hence, they will observe each other (and themselves) in one 
of these states at some time smaller than t + 2d. These statements follow from the various timeout 
conditions of at least 2-dd and the fact that observing node i in state init will make nodes switch 
to state supp i if in none or supp j, j 7^ i. Hence, all of them will switch to state supp — )• resync 
during {t,t + 2d), i.e., t is a resynchronization point. Since t follows a random distribution that is 
independent of the remaining algorithm and, as mentioned earlier, most of the times nodes cannot 
switch to state sleep and it is easy to deal with the condition on join states, there is a large 
probability that i is a good resynchronization point. Note that timeout Ri makes sure that no 
non-faulty node will switch to supp — t- resync again anytime soon, leaving sufficient time for the 
main routine to stabilize. 

The scenario we just described relies on the fact that at time t no node is in state supp — )• resync 
or state resync. We will choose R2 2> Ri, implying that i?2 + 3(i time after a node switched to state 
init all nodes have "forgotten" about this, i.e., {R2,supp i) is expired and they switched back to 
state none (unless other init signals interfered). Thus, in the absence of Byzantine faults, the above 
requirement is easily achieved with a large probability by choosing i?3 as a uniform distribution 
over some interval [R2 + 3d,R2 + ©(ni?i)]: Other nodes will switch to init 0{n) times during this 
interval, each time "blocking" other nodes for at most 0{Ri) time. If the random choice picks 
any other point in time during this interval, a resynchronization point occurs. Even if the clock 
speed of the clock driving i?3 is manipulated in a worst-case manner (affecting the density of the 
probability distribution with respect to real time by a factor of at most i?), we can just increase 
the size of the interval to account for this. 

However, what happens if only some of the nodes receive an init signal due to faulty channels 
or nodes? If the same holds for some of the subsequent supp signals, it might happen that only 
a fraction of the nodes reaches the threshold for switching to state supp — t- resync, resulting in an 
inconsistent reset of flags and timeouts across the system. Until the respective nodes switch to state 
none again, they will not support a resynchronization point again, i.e., about Ri time is "lost". 
This issue is the reason for the agreement step and the timeouts {R2, supp j). In order for any node 
to switch to state supp — resync, there must be at least n — 2f > f + 1 non-faulty nodes supporting 
this. Hence, all of these nodes recently switched to a state supp j for some j G V, resetting 
{R2,supp j). Until these timeouts expire, / + 1 G 0(n) non-faulty nodes will ignore init signals 
on the respective channels. Since there are 0{v?) channels, it is possible to choose R2 G 0{nRi) 
such that this may happen at most 0{n) times in 0{n) time. Playing with constants, we can pick 
i?3 G 0{n) maintaining that still a constant fraction of the times are "good" in the sense that R^ 
expiring at a non- faulty node will result in a good resynchronization point. 

3.4 Timeout Constraints 



Condition 3.3 summarizes the constraints we require on the timeouts for the core routine and the 



resynchronization algorithm to act and interact as intended. 
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Condition 3.3 (Timeout Constraints). Define 



25^-9 /4 \ . , 

Ag := (t9 + 3)Ti, := Ta/i? - 2Ti - d, ds := 2Ti + 3d, and 5^ :=(?? + 2 - l/'d)Ti + M. The 
timeouts need to satisfy the constraints 

Ti > Md (2) 

T2 > i?max|ri + Ag-(4i?2 + i6i? + 5)d, (^S^J + l-^^Ti + Tsj (3) 

T3 > max{(i?-l)r2 + i9(2Ti + (2?? + 4)(i),(2t?2 + 3i9-l)ri-r2 + ??(r6 + 5fi)} (4) 

T4 > Ta (5) 

T5 > max{7?(r4 + 7d)-r3 + (i9-l)r2,(^92 + t?-2)ri + i9(r2 + T4 + M)-r6} (6) 

n > ^ (ds - (i-^]ti+T2 + 2d] > ^As (7) 



Ri > i3max\TT + {M + 8)d,{2i3 + 4:- - ]Ti + 2T4 + T5- As- Ag + 17d} (9) 



T7 > i?(T2 + T4 + r5 + A, + 5, - Ag + (i) + r6-4d (8) 

3^ 

> 2^(i?i + + 2)ri + T2/^ + (8^ + 9)d)in - f) ^^^^ 

1 — A 

= uniformly distributed random variable on [&{R2 + Sd), i?(-R2 + ^d) + 8(1 — X)R2\ (11) 

A < ^-"f^"'^- . (12) 
- A, ^ ^ 

We need to show for which values of ■!? this system can be solved. Furthermore, we would like 
to allow for the largest possible drift of DARTS clocks, which necessitates to maximize the ratio 
{T2 + T4)/{'d{T2 + T3 + 4d)), that is, the minimal gap between pulses provided that the states of the 
DARTS signals are zero divided by the maximal time it takes nodes to observe themselves in state 
ready with T3 expired after a pulse (as then they will respond to DARTSj switching to one). 

Lemma 3.4. Define ??max ~ 1.247 as the positive solution o/2'i?+l = Given that i) < i?max; 



Condition 3.3 can he satisfied with Ti, . . . , T7, i?i G 0(1) and R2 G 0{n). The ratio 

{T2 + n)/^ 

T2 + n + Ad 

can be made larger than any constant smaller than 

+ + 1 
2??4 + ^3 • 

Proof. First, we identify several redundant inequalities in the system. We have that 
2^ + 2-^^Ti + n 5 3m + T2 + r4 - Te 

> Ti + Ag- (4??2 + i6t? + 5)d, 
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i.e., the left term in the maximum in 



terms in the maxima in Inequahty ( 4 ) and InequaUty ( 6 ) 



Inequahty (3) is redundant. The same holds true for the left 
since 



(2i?2 + 3^9 _ i)ri -T2 + + 5(i) 



3^Ti + (i? - l)T2 + id 

(i9 - 1)T2 + ^2Ti + {2'd + 4)d) 



and 



7?(r4 + 7d) - Ts + - l)r2 V ^{T2 + Ti-n + 7d) 

< + ^? _ 2)Ti + i){T2 + Ta + 9d) - Te. 



Finally, we can eliminate the right term in the maximum in Inequality (9) from the system, as 

T7 + {M + 8)d f T2 + n + T5 + T6 + 26s- A^ + 13d 

3 

3\ 



2?? + 4 
2t? + 4 



Ti+Ti + 2T5 + Te - Ag + I7d 
T1+T2 + 2Ti + T5 - A3 + nd. 



Next, it is not difficult to see that the right hand sides of all inequalities are strictly increasing 
in Ti (except for Inequality (12) whose right hand side decreases with Ti), implying that w.l.o.g. 
we may set Ti := A-dd. Similarly, we demand that Inequality [S) Inequality (|9|) and Inequality [IQ] 
are satisfied with equality, i.e.. 



Ri 



^T2 + T4 + Ts) +Tg- (4??2 + 4)d 
7?T7 + (M^ + 8^)d 

2{}{Ri + Ta/i? + (4t?2 + 16^? + 9)d)(n - /) 



i?3 = uniformly distributed random variable on [&{R2 + 3d),-d{R2 + 3d) + 8(1 — A)i?2] 
We set T4 := aT^ for a parameter 



a G 



2^ + 1 



implying that Inequality ( 5 ) holds by definition. The remaining simpler system is as follows. 



T2 


> 


(81?^ + 8^9^ - A'd)d + -dT^ 


(13) 


n 


> 


(8i?^ + 12i?2 + -T2 + i^Tq 


(14) 




> 




(15) 




> 


(4i?2 + 6T9-4)cf + T2 


(16) 


25^-9 


< 


T2/^ - (4??2 + 28i? + 4)d 




25^ 


r2/7? - (8?? + i)d ■ 





Note the above equalities do not affect this system and can be resolved iteratively once the other 
variables are fixed. We observe that the right hand side of Inequality (13) is increasing in T5, the 



19 



right hand side of Inequahty (15) is increasing in T3, and neither T3 nor are present in any 



further inequahties. Hence, we rule that Inequahty (14) and Inequahty (15) shall be satisfied with 
equality, i.e.. 



T3 



(8^?^ + 1219^ + -T2 + ^Tq 



and arrive at the subsystem 



T2 

n 
T2 



> 



1 + i9 - i92 



a 



(17) 



> {A^^ + 6?9 - 4)d + T2 
(4^93 + 20^92 + 3?9)(i 



> 



1 - v^(25t9-9)/(25i9)^ 



where we used that 1 + 19 — ^"^a > 0. Now we can see that Inequality (17) is also increasing in Tg, 
set 

Te := (4i92 + 6t?-4)d + r2, 

and obtain 



T2 



(a(12^5 ^ ig^4 _ 3^3) ^ (4^4 ^ ^ 3^2))^ 
- 1 + 2?9 - (i93 + ^2)a 



T2 > 



25(1 + ^(25i9-9)/(25i9))(4i9^ + 20# + 3^92)^ 



9 



(18) 
(19) 



exploiting that 1 + 2'& - {-d^ + ^'^)a > 0. 

Since a and thus i9 are constantly bounded (and we treat d as constant as well), we have a 
feasible solution for T2 G 0{1) (considering asymptotic with respect to n). Resolving the equalities 
we derived for the other variables, we see that Ti, . . . , T7, Ri S 0(1) and R2 G 0{n) as claimed. 

It remains to determine the maximal ratio {T2 + T4)/{'d{T2 + T^ + 4d)) = {T2 + aT^) / {'d{T2 + 
T3 + Ad)) we can ensure. Obviously, for any value of a, fixing either T2 or T3 implies that we want 



to minimize T2 or maximize T3, respectively. Have a look at Inequalities (13)-(16) again. The 



solution we constructed minimized T3 and subsequently T2, parametrized by feasible values of a. 
Increase now r3 by x G M"*" in Inequality ( 14 ) Consequently, we may increase Tg in Inequality ( 16 ) 



by x/19 compared to our previous solution (where we minimized all inequalities). Hence, we need 
to increase T5 by (i9a — l/'i9)a; according to Inequality (15) and finally T2 by 'd{'&a — l/'d)x. Thus, 
for any feasible a and any e > 0, we can achieve that T2 < {'d'^a — 1 + £)T^ if we just choose x large 
enough. We conclude that we can get arbitrarily close to the ratio 

(a + {d'^a - 1))T3 _ ?92a + a - 1 
d{l + (t92a - l))r3 ~ ■ 

Inserting the supremum of admissible values for a, this expression becomes 

(2i9 + 1)(?92 + 1) - (793 + 79^) _ ^93 + 2i9 + 1 
i?3(2i9 + 1) ~ 2?94 + ?93 • 

This shows the last claim of the lemma, concluding the proof. □ 
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4 Analysis 



In this section we derive skew bounds S, as well as accuracy bounds , T^, such that the presented 
protocol is a {W, £')-stabilizing pulse synchronization protocol, for proper choices of the set of nodes 
W and the set of channels E, with skew S and accuracy bounds T~ , T"*" that stabilizes within time 
T{k) e 0{kn) with probability 1 - l/2'=(''-^), for any G N. 

To start our analysis, we need to define the basic requirements for stabilization. Essentially, we 
need that a majority of nodes is non-faulty and the channels between them are correct. However, 
the first part of the stabilization process is simply that nodes "forget" about past events that are 
captured by their timeouts. Therefore, we demand that these nodes indeed have been non-faulty 
for a time period that is sufficiently large to ensure that all timeouts have been reset at least once 
after the considered set of nodes became non-faulty. 

Definition 4.1 (Coherent States). The subset of nodes W '^V is called coherent during the time 
interval [t~,t~^], iff during [t~ — {'d{R2 + 3d) + 8(1 — \)R2) — d, t~^] all nodes i £ W are non-faulty, 
and all channels Sij, i,j E W, are correct. 

We will show that if a coherent set of at least n — f nodes fires a pulse, i.e., switches to accept 
in a tight synchrony, this set will generate pulses deterministically and with controlled frequency, 
as long the set remains coherent. This motivates the following definitions. 

Definition 4.2 (Stabilization Points). We callt a W^-stabilization point (quasi-stabilization point) 
iff all nodes i G W switch to accept during [t, t + 2d) {[t, t + 3d)). 

Tlirougliout this section, we assume the set of coherent nodes W with \ W\ > n — / to 
be fixed and consider all nodes in and channels originating from V\W as (potentially) 

faulty. As all our statements refer to nodes in W, we will typically omit the word "non- faulty" 
when referring to the behaviour or states of nodes in W, and "all nodes" is short for "all nodes 
in W" . Note, however, that we will still clearly distinguish between channels originating at faulty 
and non-faulty nodes, respectively, to nodes in W. 

As a first step, we observe that at times when W is coherent, indeed all nodes reset their 
timeouts, basing the respective state transition on proper perception of nodes in W. 

Lemma 4.3. If the system is coherent during the time interval [t^,t^], any (randomized) timeout 
(T, s) of any node i £ W expiring at a time t G has been reset at least once since time 

t~ — (??(i?2 + 3d) + 8(1 — A)i?2). If t' denotes the time when such a reset occurred, for any j £ W 
it holds that Sij{t') = Sj{Tj^{t')), i.e., at time t' , i observes j in a state j attained when it was 
non-faulty. 



Proof. According to Condition 3.3, the largest possible value of any (randomized) timeout is '!?(i?2 + 
3d) + 8(1 — A)i?2. Hence, any timeout that is in state 1 at a time smaller than t~ — {'&{R2 + 3d) + 
8(1 — \)R2) expires before time ti or is reset at least once. As by the definition of coherency all 
nodes in W are non-faulty and all channels between such nodes are correct during \t~ — {'d{R2 + 
3d) + 8(1 — A)i?2) — d, t"*"], this implies the statement of the lemma. □ 

Phrased informally, any corruption of timeout and channel states eventually correct 
timeouts expire and correct links remember no events that lie d or more time in the past. Proper 
cleaning of the memory fiags is more complicated and will be explained further down the road. 
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Throughout this section, we will assume for the sake of simplicity that the system is 
coherent at all times and use this lemma implicitly, e.g. we will always assume that nodes from 
W will observe all other nodes from W in states that they indeed had less than d time ago, expiring 
of randomized timeouts at non-faulty nodes cannot be predicted accurately, etc. We will discuss 



more general settings in Section 5 



We proceed by showing that once all nodes in W switch to accept in a short period of time, 
i.e., a VF-quasi-stabilization point is reached, the algorithm guarantees that synchronized pulses 
are generated deterministically with a frequency that is bounded both from above and below. 

Theorem 4.4. Suppose t is a W -quasi- stabilization point. Then 

(i) all nodes in W switch to accept exactly once within [t,t + 3d), and do not leave accept until 
t + 4:d, and 

(a) there will be a W -stabilization point t' £ {t + (T2 + T^)/^, i + r2 + T4 + Sd) satisfying that no 
node in W switches to accept in the time interval [t + Sd, t') and that 



(Hi) each nodei's, i G W, core state machine (Figure 1) is metastability-free during [t-\-4d,t' -\-4d) 



Proof. Proof of (i) : Due to Inequality ( 2 ) , a node does not leave the state accept earlier than 
> 4:d time after switching to it. Thus, no node can switch to accept twice during [t,t-\-3d). By 
definition of a quasi-stabilization point, every node does switch to accept in the interval [t,t-\-3d) C 
[t,t + Ti/'d). This proves Statement (i). 

Proof of (ii): For each i £ W, let G [t,t-\- 3d) be the time when i switches to accept. By (i) ti 
is well-defined. Further let t[ be the infimum of times in (tj,oo) when i switches to recover, join, 
or propose In the following, denote hy i £ W a node with minimal t'. 



We will show that all nodes switch to propose via states sleep, sleep — ?■ waking, waking, and 
ready in the presented order. By (i) nodes do not leave accept before t + Ad. Thus at time t + 4d, 
each node in W is in state accept and observes each other node in W in accept. Hence, each node 
in W memorizes each other node in W in accept at time t + 4(i. For each node j £ W, let tj^g be 
the time node j's timeout Ti expires first after tj. Then tj^s £ {tj + Ti/-d,tj + Ti + Since 
\W\ > n — f, each node j switches to state sleep at time tj^s- Hence, by time t + Ti + 4d, no node 
will be observed in state accept anymore (until the time when it switches to accept again). 

When a node j £ W switches to state waking at the minimal time t^, larger than tj, it does 
not do so earlier than at time t + Ti/t? + (1 + l/7?)ri = t + (1 + 2/??)ri > t + Ti + 5d. This implies 
that all nodes in W have already left accept at least d time ago, since they switched to it at their 
respective times tj < t + Ti + 4d. Moreover, they cannot switch to accept again until t'^ as it is 
minimal and nodes need to switch to propose before switching to accept. Hence, nodes in W are 
not observed in state accept during (t + Ti + 5d,t[], in particular not by node j. Furthermore, 
nodes in W are not observed in state recover during (tw — d, t'j] . As it resets its accept and recover 
flags upon switching to waking, j will hence neither switch from waking to recover nor from trust 
to suspect during and thus also not from ready to recover. 

Now consider node i. By the previous observation, it will not switch from waking to recover, 
but to ready, following the basic cycle. Consequently, it must wait for timeout T2 to expire, i.e.. 



^^Note that we follow the convention that inf = 00 if the infimum is taken with respect to a (from above) 
ibounded subset of R J . 

^^The upper bound comprises an additive term of d since Ti is reset at some time from {tj,tj + d). 
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cannot switch to ready earlier than at time t + T2/'&. As nodes in W clear their join flags upon 
switching to state ready, by definition of node i cannot switch from ready to join, but has to 
switch to propose. Again, by definition of i, it cannot do so before timeouts T3 or T4 expire, i.e., 
before time 

T2 min{r3,r4| }5i r2 + r3(l4} 

All other nodes in W will switch to waking, and for the first time after tj, observe themselves 
in state waking at a time within (t + Ti + 4d, t + Ti(2 + 1?) + 7(i). Recall that unless they memorize 
at least / + 1 nodes in accept or recover while being in state waking, they will all switch to state 
ready by time 

max{t + T2 + 4d,t+ {^ + 2)Ti + 7d} i t + Ts + 4d. (21) 

As we just showed that t[ > t + T2 + 5d, this implies that at time t + T2 + 5d all nodes are observed 
in state ready, and none of them leaves before time t'^. 

Now choose t' to be the infimum of times from (t + (T2 + Ts)/??, t + T2 + T4 + 4(i] when a node 
in switches to state accept 



18 


Because of 


Inequality ( 


20) 



may switch to accept again after its respective time tj. We will next show that no node j ^ W 
can switch to recover within [tj , t' + 2d] . Since at time t'- node j does not memorize other nodes 
from W in state accept, it will also not do so during Hence, it cannot switch from ready to 

recover during [t[,t' + 2d] since it cannot be in state suspect during By Inequality (20), j 

cannot switch to propose within [tj,t + (T2 + T3) /■!?), and thus its timeout T5 cannot expire until 
time 



t + ^ ^ '^t + T2 + Ti + 7d>t' + 3d, (22) 

making it impossible for j to switch from propose to recover at a time within [tj,t' + 3d]. What is 
more, a node from W that switches to accept must stay there for at least Ti/t? > 3d time. Thus, 
by definition of t' , no node j £ W can switch from accept to recover at a time within [tj,t' + 3d]. 
Hence, no node j G W can switch to state recover after tj, but earlier than time t' + 2d. As nodes 
reset their join flags upon switching to state ready, it follows that no node in W can switch to 
other states than propose or accept during [t + T2 + 4d, t' + 2d]. In particular, no node in W resets 
its propose flags during [t + T2 + 5d, t' + 2d] D [t[, t' + 2d]. 

If at time t' a node in W switches to state accept, n—2f > /+! of its propose flags corresponding 
to nodes in W are true, i.e., in state 1. As the node reset its propose flags at the most recent time 
when it switched to ready and no nodes from W have been observed in propose between this time 
and t[, it holds that / + 1 nodes in W switched to state propose during [t[,t'). Since we established 
that no node resets its propose flags during [t[, t' + 2d], it follows that all nodes are in state propose 
by time t' + d. Consequently, all nodes in W will observe all nodes in W in state propose before 
time t' + 2d and switch to accept, i.e., t' G {t + (T2 + T^)/^), t + T2 + T4 + 4d) is a stabilization point. 
Statement (ii) follows. 

On the other hand, if at time t' no node in W switches to state accept, it follows that t' = 
t + T2 + T4 + 4d. As all nodes observe themselves in state ready by time t + T2 + 5d, they switch 
to propose before time t + T2 + + 5d = t' + d because expired. By the same reasoning as in 
the previous case, they switch to accept before time t' + 2d, i.e.. Statement (ii) holds as well. 



^Note that since we take the infimum on [t + (T2 + T3)/-d,t + T2 + T4 + 4d] , we have that t' <t + T2+T4 + 4d. 
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Proof of (iii) : We have shown that within [tj , t' + 2(i] , any node j £ W switches to states 
along the basic cycle only. Moreover, such nodes switch to accept at some time in [t',t' + 2d]. 
Since Ti > A'dd, this implies that no node observing itself in accept after time t' will leave this 
state before time t' + Ad. To show the correctness of Statement (iii), it is thus sufficient to prove 
that, whenever j switches from state s of the basic cycle to s' of the basic cycle during time 
\tj + d, t' + 2(i] D [t + Ad, t' + 2d] , the transition from s to join or recover is disabled from the time 
it switches to s' until it observes itself in this state. We consider transitions tr {accept, recover), 
tr{waking, recover), tr{ready, recover), tr {ready, join), and tr{propose, recover) one after the other: 

1. tr{accept, recover): We showed that node j's tr{accept, sleep) is satisfied before time t + Ad < 
t + Ti/i}, i.e., before tr{accept, recover) can hold, and no node resets its accept flags less than 
d time after switching to state sleep. When j switches to state accept again at or after time 
t' , Ti will not expire earlier than time t' + Ad. 

2. tr {waking, recover): As part of the reasoning in (ii), we derived that tr {waking, recover) does 
not hold at nodes from W observing themselves in state waking. 

3. tr {ready, recover) and tr {ready, join): Similarly, we proved that at no node in W, condition 
tr{ready, recover) or tr {ready, join) can hold during {t + (T2 + Ts)/-!?, t' + 2d), and nodes in W 
are in state ready during {t + {T2 + Ts)/'!?, t' + d) only. 



4. tr {propose, recover): Finally, the additional slack of d in Inequality (22) ensures that does 
not expire at any node in W switching to state accept during {t' ,t' + 2d) earlier than time 
t' + 3d. 

Since [tj,t' + Ad) D [t + 3d, t' + Ad), Statement (iii) follows. □ 



Inductive application of Theorem 4.4 shows that by construction of our algorithm, nodes in W 
provably do not suffer from metastability upsets once a VK-quasi-stabilization point is reached, as 
long as all nodes in W remain non- faulty and the channels connecting them correct. Unfortunately, 
it can be shown that it is impossible to ensure this property during the stabilization period, thus 
rendering a formal treatment infeasible. This is not a peculiarity of our system model, but a threat 
to any model that allows for the possibility of metastable upsets as encountered in physical chip 
designs. However, it was shown that, by proper chip design, the probability of metastable upsets 
can be made arbitrarily small [13j. In the remainder of this work, we will therefore assume 
that all non-faulty nodes are metastability-free in all executions. 

The next lemma reveals a very basic property of the main algorithm that is satisfied if no nodes 
may switch to state join in a given period of time. It states that in order for any non-faulty node to 
switch to state sleep, there need to be / + 1 non- faulty nodes supporting this by switching to state 
accept. Subsequently, these nodes cannot do so again for a certain time window. In particular, this 
implies that during the respective time window no node may switch to sleep. 

Lemma 4.5. Assume that at time ts, some node from W switches to sleep and no node from W 
is in state join during [ts —Ti — d,t~^ ]. Then there is a subset A of at least n — 2f nodes such 

that 

(i) each node from A has been in state accept at some time in the interval {ts — Ti — d,ts) and 
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(ii) no node from A is in state propose or switches to state accept during the time interval 

{ts,m.m{ts + As,t+}) . 

Proof. In order to switch to sleep at time ts, a node must have observed n — 2/ non- faulty nodes 
in state accept at times from {ts — Ti,ts], since it resets its accept flags at the time ta > tg — Ti 
(that is minimal with this property) when it switched to state accept. Each of these nodes must 
have been in state accept at some time from {tg — Ti — d, tg), showing the existence of a set A <ZW 
satisfying Statement (i). 

We will next prove Statement (ii). Consider a node i £ A. In order to switch to propose or 
again to accept, i must switch to join first or wait for T2 to expire after switching to state accept 
some time after tg — 2Ti — d. However, by assumption the first option is impossible until time t^, 
since no nodes are in state join during [tg — Ti — d, t^]. Therefore, j will not be in state propose 
or switch to state accept again until tg — 2Ti + T2/'& — d = tg -\- Ag or respectively, whatever is 
smaller. This proves Statement (ii). □ 

Granted that nodes are not in state join, this implies that the time windows during which nodes 
may switch to sleep and sleep — )• waking, respectively, are well-separated. 

Corollary 4.6. Assume that during [t~ — Ti — d, t^] no node from W is in state join, where 
t+ -t- <Ag. Then 

(i) any time interval \ta,tb\ C of minimum length containing all switches of nodes in W 

from accept to sleep during [t~ , f^] has length at most 2Ti + 3d, and 

(ii) granted that no node from W switches to state sleep during {t~ — (t? + l)Ti — d,t~), any 
time interval \ta,tb\ C \t~ ,t^ + (1 + l/'!9)Ti] of minimum length containing all times in 
[t^ , t+ + (1 + l/t?)Ti] when a node from W switches to sleep — )• waking has length at most Sg. 

Proof. Consider Statement (i) first. If there is no node from W that switches from accept to sleep 
during the statement is trivially satisfied. 

Otherwise, choose any such interval [ta,ib]- Since \ta,th\ 7^ is minimal, both at time ta and t;, 
some nodes from W switch to sleep. Assume by means of contradiction that tb — ta > 2Ti + 3d. 
Due to the constraints on t^ and t~^, we have that tb < ta + Ag. Moreover, during [ta — Ti — d, tb] C 



[t — Ti — d, f^] no node from W is in state join. Thus, we can apply Lemma 4.5 to ta and see that 



at least n — 2/ > / + 1 nodes from W do not switch to accept in the time interval 

{ta,ta + Ag) D {tb-{2Ti + 3d),tb]. 

As nodes from W leave state accept as soon as Ti expires, these nodes are not in state accept during 
[tb — (Ti + 2d),tb\, implying that they are not observed in this state during [tb — (Ti + d),tb]. It 
follows that no node in W can observe more than n — f — 1 different nodes in state accept during 
[tb — {Ti + d) , tb] . As nodes from W clear their accept flags upon switching to accept and leave 
state accept after less than Ti + d time, we conclude that no node from W switches to state sleep 
at time tb- This is a contradiction, implying that the assumption that tb — ta > 2Ti + 3d must be 
wrong and therefore Statement (i) must be true. 

To obtain Statement (ii), observe first that any node from W switching to state sleep at some 
time t < t~ — + l)Ti — d switches to state sleep — )• waking before time t~ . Subsequently, it 
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needs to switch to state sleep again in order to be in state sleep — )• waking at or later than time 
t~ . On the other hand, every node that switches to sleep after time wiU not switch to sleep — ?■ 
waking again before time t"*" + (1 + l/-d)Ti. Hence, any node switching to state sleep — )• waking 
during the considered interval must switch to sleep during Applying Statement (i) to 

yields that nodes from W can only switch to sleep within a time interval of length at most 
2Ti + 3d. Considering the fastest and slowest possible transitions from sleep to sleep — )• waking we 
obtain that nodes from W can switch to sleep — t- waking within a time interval of length at most 
2Ti + 3d+{'d + l)Ti + d-{l + l/^)Ti = 6s. Statement (ii) follows. □ 

We are now ready to advance to proving that good resynchronization points are likely to occur 
within bounded time, no matter what the strategy of the Byzantine faulty nodes and channels is. 
To this end, we first establish that in any execution, at most of the times a node switching to state 
init will result in a good resynchronization point. This is formalized by the following definition. 

Definition 4.7 (Good Times). Given an execution E of the system, denote by £' any execution 
satisfying that £\'^q^^ = £\[o,t)> where at time t a node i £ W switches to state init in £' . Time 
t is good in £ with respect to W provided that for any such £' it holds that t is a good W- 
resynchronization point in £' . 

The previous statement thus boils down to showing that in any execution, the majority of the 
times is good. 

Lemma 4.8. Given any execution £ and any time interval the volume of good times in £ 

during [t~,t~^] is at least 

.2u+ ll(l-A)i?2 



x'{t+-t-) 



Proof. Assume w.l.o.g. that \ W\ 



N 



n - 



f (otherwise consider a subset of size n — f) and abbreviate 











:= 






> 


'^{t+ -t-) + R2/m' 

R2 


(n-f) 








'^{t+ - t-) + T?(i?i + (i? + 2)Ti + T2/'d + {M + 9)d)/(5(l - A))" 




R2 








"??(t+ -t- + {Ri + {'& + 2)Ti + T2/^ + {8^ + 9)d)y 


(n-f) 






R2 




1 


'^t+ -t- +Ri + Ti + 4d + Ag) ' 


(n-f). 





(n-f) 



R2 

The proof is in two steps: First we construct a measurable subset of that comprises 

good times only. In a second step a lower bound on the volume of this set is derived. 

Constructing the set: Consider an arbitrary time t G [t~ ,t^], and assume a node i gW switches 



to state init at time t. When it does so, its timeout R3 expires. By Lemma 4.3 all timeouts of node 
i that expire at times within [t~,t~^], have been reset at least once until time t~ . Let tEs be the 
maximum time not later than t when R3 was reset. Due to the distribution of R3 we know that 



tE3 S t-{R2 + 3d). 
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Thus, node i is not in state init during time [t — {R2 + 2d), t), and no node j £ W observes i in state 
init during time [t — {R2 + d),t). Thereby any node j's, j G W, timeout (i?2, supp i) corresponding 
to node i is expired at time t. 

We claim that the condition that no node from W is in or observed in one of the states resync 
or supp — )• resync at time t is sufficient for t being a Ty-resynchronization point. To see this, 
assume that the condition is satisfied. Thus aU nodes j £ W are in states none or supp k for 
some k E {!,..., n} at time t. By the algorithm, they all will switch to state supp i or state 
supp —7- resync during {t,t -\- d). It might happen that they subsequently switch to another state 
supp k' for some k' € V, but all of them will be in one of the states with signal supp during 
{t + d,t + 2d]. Consequently, all nodes will observe at least n — f nodes in state supp during 
{t', t + 2d) for some time t' < t + 2d. Hence, those nodes in W that were still in state supp i (or 
supp k' for some k') at time t + d switch to state supp — t- resync before time t + 2d, i.e., t is a 
Vl^-resynchronization point. 

We proceed by analyzing under which conditions t is a good l^-resynchronization point. Recall 
that in order for t to be good, it has to hold that no node from W switches to state sleep during 
{t — Ag, t) or is in state join during {t — Ti — d,t + Ad). 

We begin by characterizing subsets of good times within {tr,t[) C where and t[. are 

times such that during {tr,t'j.) no node from W switches to state supp — t- resync. Due to timeout 



Ri y {M + 2)d, 

we know that during {tr + .Ri + 2d,t'^), no node from W will be in, or be observed in, states 
supp — )• resync or resync. Thus, if a node from W switches to init at a time within (tr + i?i + 
2d,t'r), it is a VF-resynchronization point. Further, all nodes in W will be in state dormant during 
{tr + Ri + 2d,t'r + 4:d). Thus all nodes in W will be observed to be in state dormant during 
{tr + i?i + 3d, t'r + 4:d), implying that they are not in state join during {tr + Ri + 3d, t'r + 4d). In 
particular, any time t £ {tr + Ri + Ti + 4:d,tr) satisfies that no node in W is in state join during 
{t-Ti -d,t + M). 

Further define ta to be the infimum of times from {tr + Ri + Ti + Ad, t'r] when a node from 



W switches to state sleep. By Corollary 4.6, no node from W switches to state sleep during 
{ta + 6s,ioam{ta + As, t'r}). Hence, if ta < 00, all times in both {tr + Ri + Ti + Ad + Ag, ta) and 
{ta + 5s + Ag, min{ta + As, t'r}) are good. 

In case ta <t'^ — As we can repeat the reasoning, defining that t'^ is the infimum of times from 
\ta + As, t'r] when a node switches to state sleep. By analogous arguments as before we see that all 
times in the sets [ta + As, t'a) and (i^ + 5s + Ag, min{t'^ + As, t'r}) are good. 

By induction on the times ta,t'^, . . . ,t'l (halting once t^ > t'r — As), we infer that the total 
volume of times from {tr,t'r) as well as from {tr + Ri + Ti + Ad + Ag, t'r) that is good is at least 

t'r - {tr + Rl+Ti+Ad + A, 



As 

t'r - {tr + Rl + Ti + Ad + Ag + As) 

A. 



{As -Ag- 6s) > 

{As -Ag- 5s) . (23) 



In other words, up to a constant loss in each interval {tr,t'r), a constant fraction of the times 
are good. 
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Volume of the set: In order to infer a lower bound on the volume of good times during 

we subtract from [t^, t"*"] all intervals [tr, U + Ri + Ti + id + Ag], where a node from W switches 

to supp — )• resync at a time U within [t~ — {Ri + Ti + 4d + Ag),t~^]. Formally define 



G 



U 



[tr,tr+Rl + Ti+4d + Ag]. 



tr€[t- -iRi+Ti+4.d+Ag),t+] 
3i&W:i switches to supp— ^resync at tr 



What remains is the set [t^, t^] \ G, that has as subset the union of intervals (U + Ri + Ti + 4d + 
^gjt'j.) C where and t'^ are times at which a node from W switches to supp — )• resync 

and no node from W switches to supp — )■ resync within (tr, t'^). Note that for each such interval we 
already know it contains a certain amount of good times because of Inequality (23) In order to 
lower bound the good times in it is thus feasible to lower bound the volume and number 

of connected components (i.e., maximal intervals) of any subset of \ G. 



Observe that any node in W does not switch to state init more than 



t- +R1+T1+M+ Ag 

R^ 



+ R1+T1+M+ Ag 

R2 + d 



< 



N 



n - f 



(24) 



times during [t~ - (Ri + Ti + Ad + Ag),t+]. 

Now consider the case that a node in W switches to state supp — t- resync at a time t satisfying 
that no node in W switched to state init during (t — (S"!? + 6)d, t). This necessitates that this node 
observes n — / of its channels in state supp during (t — (2i? + l)d,t), at least n — 2/ > / + 1 of 
which originate from nodes in TV. As no node from W switched to init during (t — (8t? + 6)d,t), 
every node that has not observed a node i G V \ W in state init at a time from (t — (S'd + 4)(i, t) 
when (i?2, supp i) is expired must be in a state whose signal is none during (t — {2'd + 2)d,t) due 
to timeouts. Therefore its outgoing channels are not in state supp during [t — {2'd + l)d,t). By 
means of contradiction, it thus follows that for each node j of the at least / + 1 nodes (which are 
all from W), there exists a node i £ V\W such that node j resets timeout (i?2, supp i) during the 
time interval {t - (Si? + 4)d, t). 

The same reasoning applies to any time t' ^ [t — (S"!? + Q)d,t) satisfying that some node in 
W switches to state supp — )• resync at time t' and no node in W switched to state init during 
{t' — (8?? + 6)d, t'). Note that the set of the respective at least / + 1 events (corresponding to the at 
least / + 1 nodes from W) where timeouts {R2, supp i) with i £V\W are reset and the set of the 
events corresponding to t are disjoint. However, the total number of events where such a timeout 
can be reset during [f — {Ri +Ti + Ad + A^), t"*"] is upper bounded by 



t+ - t- + Ri + Ti+Ad + Ag 
R2M 



< (/ + m, 



(25) 



i.e., the total number of channels from nodes not in W {\V \ W\ many) to nodes in W multiplied 
by the number of times the associated timeout can expire at the receiving node in W during 

[t- -{Rl+Tl+4:d+Ag),t+]. 



With the help of inequalities (24) and (25), we can show that G can be covered by less than 



2N intervals of size {Ri +Ti + 4d + Ag) + (8t? + 6)d each. By Inequality (24) , there are no more 



than N times t £ [t — (Ri + Ti + Ad + Ag),t~^] when a non-faulty node switches to init and 
thus may cause others to switch to state supp — )• resync at times in [t,t + (8-d + 6)d]. Similarly, 
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Inequality ( 25 ) shows that the channels from V \ to may cause at most — 1 such times 
t £ [t~ — {Ri+Ti + 4:d + Ag),t~^], since any such time requires the existence of at least / + ! events 
where timeouts {R2,supp i), i £ V \ W, are reset at nodes in W, and the respective events are 
disjoint. Thus, all times U S [t~ — + Ti + 4(i + Ag),t~^] when some node i G W switches to 
supp — )• resync are covered by at most 2N — 1 intervals of length (S"!? + 6)d. 

This results in a cover G' ^ G consisting of at most 2N — 1 intervals that satisfies that 

vol (G) < vol (G') < 2N{Ri + Ti + Ag + (8i9 + W)d). 



Summing over the at most 2N intervals that remain in [t , t+] \ G' and using Inequality ( 23 ) 
we conclude that the volume of good times during [t~ , t~^] is at least 

t+-t- - 2N{Ri + Ti + (8 ^ + 10)d + Ag + As) ,^ _a _x\ 



(A, -Ag-6s 



As 

t+-t-- 2N{Ri + + 2)ri + Ta/i^ + (8^ + 9)d) 

As 

x(^t+-t--2 (^^^^^^^ + iJ) " + + ^^^^ + ^2/^? + (8^? + 9)d)^ 

^ _ 2^(i?i + {'& + 2)Ti + T2/^ + (8^ + 9)d){n - /) ^| _ 

nX{Ri + (i9 + 2)Ti + T2/^ + (8i? + 9)d){n - f) 



5 

11(1 - A)i22 



10?? 

as claimed. The lemma follows. □ 

We are now in the position to prove our second main theorem, which states that a good resyn- 
chronization point occurs within 0{R2) time with overwhelming probability. 

Theorem 4.9. Denote by := 7?(i?2 + Sd) + 8(1 — \)R2 + d the maximal value the distribution 
R3 can attain plus the at most d time until R3 is reset whenever it expires. For any k G N and 
any time t, with probability at least 1 — (l/2)'^("~'^) there will be a good W -resynchronization point 
during [t,t + {k + 1)-E'3]. 

Proof. Assume w.l.o.g. that \W\ = n — f (otherwise consider a subset of size n — f). Fix some node 
i £ W and denote by to the infimum of times from [t, t + (k + 1)-E3] when node i switches to init. 
We have that to < t + E^. By induction, it follows that node i will switch to state init at least 
another k times during [t, t + (fc + 1)-E'3] at the times ti < t2 < . . . < t^. We claim that each such 
time tj, j G {1, ..,k}, has an independently by 1/2 lower bounded probability of being good and 
therefore being a good 1^-resynchronization point. 

We prove this by induction on j: As induction hypothesis, suppose for some jS{l,...,A:— l},we 
showed the statement for j' € {1, . . . , j — 1} and the execution of the system is fixed until time tj_i, 
i.e., <?|[o,tj_i] is given. Now consider the set of executions that are extensions of <?|[o,tj_i] and have 
the same clock functions as S. For each such execution S' it holds that <?'|[o,tj_i] = |[o,tj_i]5 and all 
nodes' clocks make progress in £' as in £. Clearly each such £' has its own time tj < t + (j + 1)-E'3 
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when i?3 expires next after tj-i at node i, and i switches to init. We next characterize the 
distribution of the times tj. 

As the rate of the clock driving node i's R3 is between 1 and tj > tj-i is within an interval, 
call it [t~,t^], of size at most 

t+ -r <8(1-A)i?2, 
regardless of the progress that z's clock C makes in any execution £' . 



Certainly we can apply Lemma 4.8 also to each of the £' , showing that the volume of times 



from [t , t"*"] that are not good in £' is at most 

Since clock C can make progress not faster than at rate 'd and the probability density of R3 is 
constantly 1/(8(1 — \)R2) (with respect to the clock function C), we obtain that the probability of 
tj not being a good time is upper bounded by 

(l_A2)(t+-r) + ll(l-A)i?2/(10^) < on _ ^2^ , 11 |f aJL ^ 1 = 1 
8(l-A)i?2/i9 ^ 80 25i? 50 2' 

Here we use that the time when R3 expires is independent of £'\[o^tj-i]- 

We complete our reasoning as follows. Given £\[(}^tj-i]i '^^ permit an adversary to choose £' , 
including random bits of all nodes and full knowledge of the future, with the exception that we 
deny it control or knowledge of the time tj when R3 expires at node i, i.e., £' is an imaginary 
execution in which R^ does not expire at i at any time greater than tj-i- Note that for the good 
T^-resynchronisation points we considered, the choice of £' does not affect the probability that 
ti, . . . , tj-i are good VK-resynchronization points: The conditions referring to times greater than a 
PF-resynchronisation point t, i.e., that all nodes in W switch to state supp — t- resync during (t, t-\-2d) 
and no node in W shall be in state join during (t — Ti — d,t + 4d) , are already fully determined by 
the history of the system until time t. As we fixed £' , the behaviour of the clock driving R^ is fixed 
as well. Next, we determine the time tj when R3 expires according to its distribution, given the 
behaviour of node z's clock. The above reasoning shows that time tj is good in £' with probability 
at least 1/2, independently of <?'|[o,tj_i] = £\[o,tj-i]- We define that £\[o,tj) = ^'\[o,tj) ™ ^ node 
i switches to state init (because Rs expired). As — conditional to the clock driving i?3 and tj-i 
being specified — tj is independent of <S|[o^(^), £ is indistinguishable from £' until time tj. Because 
tj is good with probability at least 1/2 independently of £\'^q^. ^ = £\[o^tj^i]j so it is in £. Hence, 
in £ tj is a good M^-resynchronization point with probability 1/2, independently of i?|[o,tj_i]- Since 
£' was chosen in an adversarial manner, this completes the induction step. 

In summary, we showed that for any node in W and any execution (in which we do not 
manipulate the times when R3 expires at the respective node), starting from the second time 
during [t, t + (k + l)i?3] when R3 expires at the respective node, there is a probability of at least 1/2 
that the respective time is a good ll/'-resynchronization point. Since we assumed that \W\ = n — f 
and there are at least k such times for each node in W, this implies that having no good W- 
resynchronization point during [t,t + (k + 1)E3] is as least as unlikely as k{n — f) unbiased and 
independent coin flips all showing tail, i.e., (l/2)*^("~-^\ This concludes the proof. □ 

Having established that eventually a good VF-resynchronization point tg will occur, we turn to 
proving the convergence of the main routine. We start with a few helper statements wrapping up 
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that a good resynchronization point guarantees proper reset of flags and timeouts involved in the 
stabilization process of the main routine. 

Lemma 4.10. Suppose tg is a good W -resynchronization point. Then 

(i) each node i £ W switches to passive at a time ti G {tg + Ad, tg + (4?? + 3)d) and observes itself 
in state dormant during [tg + 4(i, rj^j(tj)), 

(a) Memjjjoin l[ri j(ti),tjoin] = for all i,j G W, where tjoin > tg + Ad is the infimum of all times 
greater than tg — Ti — d when a node from W switches to join, 

(Hi) Memjj^sieep^waking |[r,,,(t,),t«] = f'^'^ ^ ^> where ts > tg + {1 + l/'&)Ti is the infimum 

of all times greater or equal to tg when a node from W switches to sleep — t- waking, 

(iv) no node from W resets its sleep — t- waking flags during [tg + (1 + l/^)Ti,tg + Ri/tl)], and 

(v) no node from W resets its join flags due to switching to passive during [tg + (1 + l/'d)Ti,tg + 



Proof. All nodes in W switch to state supp — >• resync during {tg,tg + 2d) and switch to state 
resync when their timeout of t?4(i expires, which does not happen until time tg + Ad. Once this 
timeout expired, they switch to state passive as soon as they observe themselves in state resync, 
i.e., by time tg + (4^9 + 3d). Hence, every node i £ W does not observe itself in state resync within 
[tg + 3d,Ti^i{ti)), and therefore is in state dormant during [tg + 3d, Tj^j(ti)]. This implies that it 
observes itself in state dormant during [tg + 4d, rj^i(ti)), completing the proof of Statement (i). 

Moreover, from the definition of a good W^-resynchronization point we have that no nodes from 
W are in state join at times in [tg — Ti — d,tjoin)- Statement (ii) follows, as every node from W 
resets its join flags upon switching to state passive at time tj. 

Regarding Statement (iii), observe first that no nodes from W are in state sleep — >• waking during 
{tg — d,tg + {1 + l/'&)Ti) for the following reason: By definition of a good VF-resynchronization 
point no node from W switches to sleep during {tg — ^g,tg) 5 {tg — (■/? + l)Ti — 3d, tg). Any node 
in W that is in states sleep or sleep — )• waking at time tg — {-& + l)Ti — 3d switches to state waking 
before time tg — d due to timeouts. Finally, any node in W switching to sleep at or after time tg 
will not switch to state sleep — >• waking before time tg + {1 + l/-d)Ti. The observation follows. 

Since nodes in W reset their sleep — )• waking flags at some time from 



Statement (iii) follows. 

Statements (iv) and (v) follow from the fact that all nodes in W switch to state passive until 
time 



while timeout (i?i, supp — )• resync) must expire flrst in order to switch to dormant and subsequently 



[U, Ti^^{ti)] C {tg + 3d, tg + (4^? + 4)d) ^ {tg + 3d, t g + {l + I / 'd)Tl) , 




passive again. 



□ 
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Before we proceed, in the next lemma we make the basic yet crucial observation that after a 
good 1^-resynchronization point tg, no node from W will switch to state join until either time 
+ Ty/'i? + 4d or Tg/"!? time after the first non- faulty node switched to sleep — t- waking again after 
tg. By proper choice of Tq and Ti > Tq, this will guarantee that nodes from W do not switch to 
join prematurely during the final steps of the stabilization process. 

Lemma 4.11. Suppose tg is a good W -resynchronization point. Denote by tg the infimum of times 
greater than tg when a node in W switches to state sleep — )• waking and by tjom the infimum of 
times greater than tg — Ti — d when a node in W switches to state join. Define := tg + — 
+ (5s + 72 + T4 + T5 + d. Then, starting from time tg + Ad, (recover, join) is not satisfied at 
any node in W until time 



min 



Us + ^,tg + ^+Ad\> min{t, + A„ t+} 



and tjoin is larger than this time. 



Proof. By Statements (ii) and (iii) of Lemma 4.10 and Inequality (2) , we have that tg > tg+Ti+Ad > 
tg+{A'i}+A)d and tjom > tg+4d. Consider a node i £ W not observing itself in state dormant at some 
time t G [tg-\-4:d,tjoin]. According to Statements (i) and (ii) of Lemma 4.10, the threshold condition 



of / + 1 nodes memorized in state join cannot be satisfied at such a node. By statements (i) and (iii) 
of the lemma, the threshold condition of / + 1 nodes memorized in state sleep — )• waking cannot be 
satisfied unless t > tg. Hence, if at time t a node from W satisfies that it observes itself in state 
active and Tg expired, we have that t > tg + Tg/??. Moreover, by Statement (i) of Lemma 4.10, we 



have that if Tj is expired at any node in W at time t, it holds that t > tg + T'j /{} + Ad. Altogether, 
we conclude that tr (recover, join) is not satisfied at any node in W during 



tq + Ad,mm\tg + —,t„ + — + Ad 



[tg + Ad,mm{tg + Ag,t+}] . 



In particular, tjoin must be larger than the upper boundary of this interval, concluding the proof. □ 

Before we can move on to proving eventual stabilization, we need one last key lemma. Essen- 
tially, it states that after a good ly-resynchronization point, any node in W switches to recover or 
to sleep — )• waking within bounded time, and all nodes in W doing the latter will do so in rough 
synchrony, i.e., within a time window of 6g. Using the previous lemma, we can show that this 
happens before the transition to join is enabled for any node. 



Lemma 4.12. Suppose tg is a good W -resynchronization point and use the notation of Lemma 4-11 
Then either 

(i) tg < t+ — As and any node in W switches to state sleep waking at some time in [tg,tg + 5g 
or is observed in state recover during [tg + Ti + T5,tjoin] or 



(ii) all nodes in W are observed in state recover during [t'^,t 



joinj 



Proof. By Lemma 4.11, it holds that 



tjoin > minjts + Ag,t~^}. 



(26) 
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For any node in W, consider the supremum t of all times smaller or equal to tg 
switched to sleep. After that, it observed itself in state waking before time 

t + {'d + l)Ti + 3d<tg-Ti-d 



Ag when it 



(27) 



(w.l.o.g. assuming that the node has ever been in state sleep since it became non-faulty). By 
definition of a good VF-resynchronization point, nodes in W are not in state join during (tg — Ti — 



d, tjoin) and do not switch to state sleep during {tg 



Ag, tg 



Continuing to execute the basic cycle 



after time tg — d > tg — Ti — d thus necessitates that the node is in one of the states waking, ready, 
propose or accept at time tg — d. 

Assume that it is in state waking (we just showed that if not, it already was in waking by time 
tg — d). As timeout T2 cannot have been reset later than time t — Ti/'d + d< tg — Ag — Ti/'d + d at 
the respective node, it observes itself in state ready by time tg — Ag — Ti/'d + T2 + 2d, in state propose 
by time tg — Ag — Ti/{^ + T2 + Ti + 3d, in state accept by time tg — Ag — Ti / + T2 + + + Ad, 
in state sleep by time — + (1 — 1/ '&)Ti + + + + 5d, and must switch to sleep — )• waking 
before time t"*" — A^. 

We next distinguish between two cases: 

Case 1: Assume that tg < t~^ — Ag. We already established that no node in W observes itself 
in states sleep or sleep — t- waking at time tg, and by Inequality (27), any node in W observing 
itself in states waking or ready reset its accept flags after time tg — Ti — d. Denote by t'g £ 
{tg — {"& + l)Ti — d,ts — {1 + l/'!?)Ti) the minimal time greater or equal to tg when a node from W 
switches to state sleep; by the timeout condition for switching from sleep to sleep — )• waking and the 
definitions of tg and good VF-resynchronization points, such a time exists. According to Lemma 4.5[ 
at least / + 1 nodes have been in state accept at times in {t'g — Ti — d,t'g). By Statements (i) 
and (iii) of Lemma 4.10, all nodes are in state passive until at least time tg. Hence, any nodes 



from W observing themselves in state waking or ready at time t'g + d satisfy tr {waking, recover) or 
tr{unsuspect, suspect), respectively. Consequently, they will leave these states no later than time 

t'g + {2d + 2)d <tg- (^l + ^^Ti + {2i) + 2)d ft, - Ad. 

It follows that any nodes from W that are in state propose at time tg observe themselves in 
this state since at least time tg — 3d, implying that they switch to states accept or recover by time 
tg + — 3d. After switching to accept, a node from W switches to sleep and subsequently to 
sleep — >• waking within another (2?? + l)Ti + 2d or is observed in state recover after less than Ti + 2d 
time. Thus, as 



tjo^n >tg + Ag-{'&- l/'&)Ti -d)y tg+Ti+n-d, 



all nodes in W that do not switch to state sleep — )• waking during 



[tg,tg + {2d + i)Ti + n-d] 



Ti-d 



C 



tg,t'g + Ag+{l + -]Ti 



are observed in state recover at time + Ti + T5 . Because tjoin > tg + Ag — {d — 1 / i9)Ti — d and no 



nodes from W switch to state sleep during {tg — Ag,tg), we can apply Statement (ii) of Corollary 4.6 
to conclude that no nodes from W switch to state sleep — )• waking during 



tg + 6g,t', + Ag+ 1 + 



1 



Ti 
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i.e., any node from W that does not switch to state sleep — )• waking during [ts,ts + Sg] is observed 
in state recover during [tg + Ti + T^jtjoin]- Statement (i) fohows. 

Case 2: Assume tg > — Ag. Then by Inequahty (26) 
the first node in W switching to sleep 



waking after tg does so at time t, 
given above, no node from W executing the basic cycle does so later than t 



holds. By definition of tg, 
and by the arguments 



A, < t"*" — d. Hence, 



it is observed in state recover during [t~^ , tjoin] , as it cannot leave recover through join before time 



Hence Statement (ii) holds and the proof concludes. 



□ 



We have everything in place for proving that a good resynchronization point leads to stabiliza- 
tion within Ri/'Q — 2>d time. 

Theorem 4.13. Suppose tg is a good W -resynchronization point. Then there is a quasi- stabilization 
point during {tg,tg + Ri/i) — 3d]. 



Proof. For simplicity, assume during this proof that Ri = oo, i.e., by Statement (i) of Lemma 4.10 



all nodes in W observe themselves in states passive or active at times greater or equal to tg + (4i? + 
4:)d. We will establish the existence of a quasi-stabilization point at a time larger than tg and show 
that it is upper bounded by tg + Ri/i!} — 3d. Hence this assumption can be made w.l.o.g., as the 
existence of the quasi-stabilization point depends on the execution up to time tg + Ri/^ only, and 
Ri cannot expire before this time at any node in W. We use the notation of Lemma 4.11, By 

and 



Statements (ii) of Lemma 4.10 
By 



Lemma 4.11 



Inequality \2\\ we have that tg >tg + Ti + Ad> tg + {A'd + A)d. 
it holds that tjoin > minji^ -|- Ag,t~^}. We differentiate several cases. 



Case 1: Assume is > t — A^. According to Lemma 4.10, all nodes in W switched to state 
passive during {tg + 4d, tg -\- {3 + 4:'i})d), implying that at any node in W, Tj will expire at some time 



from {tg -\-T-j/'d + Ad, tg -\- Tj -\- (4?? -|- 4)d. By Lemma 4.12 we have that all non-faulty nodes are 
observed in state recover during [t~^, tjoin]- By Statement (v) of Lemma 4.10, no node in W resets its 
join flags after time t~^ before it switches to state propose, returning to the basic cycle. Thus, any 
node from W will switch to state join before time tg + Tj + + 4)d and switch to propose as soon 
as it memorizes all non-faulty nodes in state jom. Denote by tp £ {tg -\-Tt / t!) -\- Ad, tg -\-Tj -\- {i'd -\- 5)d) 
the minimal time when a node from W switches from join to propose. Certainly, nodes in W do not 
switch from waking to ready during (tp, tp -\- 2d) and therefore also not reset their join flags before 
time tp + 3d. As nodes in W reset their propose and accept flags upon switching to state join, some 
node in W must memorize n — 2f > f + 1 non-faulty nodes in state join at time tp. According to 



Statement (ii) of Lemma 4.10 these nodes must have switched to state join at or after time tjoin- 
Hence, all nodes in W will memorize them in state join by time tp -\- d and thus have switched to 
state join. Hence, all nodes in W will switch to state propose before time tp -\- 2d and subsequently 
to state accept before time tp -\- 3d, i.e. tp <tg + Tt + (4i? -|- 5)d is a quasi-stabilization point. 
Case 2a: Assume tg < — A^ and < / + 1 nodes in W swi tch to sleep — )• waking during 

[ts,tg -\- 5g]. We then have that ^ tg + Ti + T5. According to Lemma 4.12 
that does not switch to state sleep — )• waking is observed in state recover during [tg 
Thus, any node in W will observe at least n — 2f > f + 1 nodes from W in state recover during 
[tg + Ti -\-T^, tjoin\ ■ As nodes in W reset their propose flags when switching to state ready and 



any node in W 

~l~ -\- T5 , tjoin] • 



tg + Ti + n^ tg 



+ 



T2 + T3 



{'d + 2)Ti - {2-0 + A)d, 
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a node from W switching to state sleep — )• waking at or after time ts cannot switch to propose via 
states waking and ready before time is + Ti + T5 + (2'!? + l)d. Any node in W switching to state 
sleep — )• waking during [ts,ts + ds] will observe itself in state waking before time 

ts + Ss + 2d^ ts + As - {2d + 2)d < t+ - {2d + 2)d. 



By Lemma 4.11 tr {recover, join) cannot be satisfied at any node in W until time minjts + A^, t^}. 
Thus, we have that no node from W switches from ready to join during [ts,tjom) by definition of 
tjoin and any node in W that observes itself in states ready and suspect will switch to state recover 
once {2'dd, suspect) expires. In summary, any node in W switching to state sleep — >• waking at 
some time in [tg^tg + Sg] will switch from waking to recover or from unsuspect to suspect by time 
tg + Ag — (2'i?+2)(i, and in the latter case it cannot leave state ready before switching to state recover 
due to tr {ready, recover) being satisfied. As the latter happens before time tg + Ag — d < tjoin — d, all 
nodes in W are observed in state recover during [ts + Ag, tjoin]- From here we can argue analogously 
to the first case, i.e., there exists a quasi-stabilization point tp < tg + + {Ad + 5)d. 

Case 2b: Assume tg < t^ — Ag and > / + 1 nod es in W switch to sleep — )• waking during 



[tg,tg + 5s]. By Statements (ii) and (iv) of Lemma 4.10 no node from W resets its sleep — )■ waking 



flags at or after time tg > tg -\- {1 + l/d)Ti. Hence, by Statement (i) of the lemma, all nodes in W 
switch to active during {tg,ts + 6g + d). Between T^/d and Tq + d time later Tg will expire. We 
have that 

tg + ^<t^-Ag + ^^tg+^+4d. 



Thus, according to Lemma 4.11 , tjoin > tg + T^/d. On the other hand, at the latest once Tq expires, 



tr {recover, join) holds at every node. 
By time 



tg + '^ytg + dg-(^l-^jTi + T2 + 2d, 
the nodes in W that switched to state sleep — )• waking observe themselves in state ready because 



of timeouts or are in state recover. By Statement (v) of Lemma 4.10, after this time no node in W 
resets its join flags again before it runs through the basic cycle again and switches to state ready. 
Hence, all nodes in W will switch to states join or propose until time 

max ltg + 6g-(l + ^]Ti+T2 + T4 + 2d,tg + 6g + T6 + 3d\+d 



^ tg+[d + l-ljT,+T, + T, + 7d, 

where we accounted for an additional delay of d due to a possible transition from ready to recover 
just before time tg + 6g + Tq + 2d and, if no node from W switches from ready to join, all nodes in 
W needing to be observed in state join for a node in W to switch to state propose. It follows that 
a minimal time tp G {tg + Te/i?, tg + {d + 1 - 2/d)Ti +T2+T4 + Id) exists when a node from W 
switches to state propose. Again, we distinguish two cases. 

Case 2b-I: Assume that some node in W switches from state join to state propose at time tp. 
Thus, there must be at least n — 2f> / + 1 non-faulty nodes in state join at time tp — e (for some 
arbitrarily small e > 0), as any propose or accept flag corresponding to a non- faulty node has been 
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reset at a time t satisfying that the respective node has not been observed in one of these states 
during [t,tp]. Thus, ah nodes in W wih switch to states join or propose before time tp + d. At 
time tp + 2d, they wih observe all non-faulty nodes in one of the states join, propose, or accept, 
i.e., they switch to state propose before time tp + 2d. Finally, they will observe all non- faulty nodes 
in states propose or accept before time tp + 3d < tp + Ti/-d and switch to state accept. As tp is 
minimal, we conclude that all nodes in W switched to state accept during {tp,tp + 3d), i.e., tp is a 
quasi-stabilization point. 

Case 2b-II: Otherwise, some node in W switched from state ready to state propose at time tp. 
As we have that _ 



ts + 5s + n + U^ts-{^ + l)Ti + - d, 

Tg is expired at all nodes in W since time tp — 2d, i.e., tr [recover, joiri) is satisfied at all nodes in 
W since time tp — 2d. Hence, all nodes in W are observed in states ready or join at time tp, and 
no node from W may switch to state recover again or reset its propose flags before switching to 
resync or accept first after time tp. 

Denote by ta the infimum of times greater than tp when a node from W switches to accept and 
assume for the moment that no node from W may switch from propose to recover before switching 
to accept first after time tp. As nodes in W reset their propose fiags upon switching to states ready 
or join, there must be n — 2/ > / + 1 non-faulty nodes that switched to state propose during [tp, ta) 
(unless ta = oo, which will be ruled out shortly). Thus, all nodes in W leave state ready before 
time ta + d, and are observed in states propose or join before time ta + 2d. Recalling that all nodes 
in W switch to states join or propose until time 

ts+{^ + l-'^Ti+T2 + Ti + Id, 

we get that indeed all nodes in W are observed in one of these states after time tp and before time 

min l^ta + 2d,ts + {^ + 1- Ti + T2 + Ti + 8d 

Thus, at any node from W, tr{join, propose) will be satisfied before this time, and it will be observed 
in state propose less than d time later. It follows that all nodes in W switch to state accept before 
time 

tq + 3d := min jt^ + 3d,ts + (^^ + I - T1+T2+T4 + 9d| , 

i.e., tq is a quasi-stabilization point. As we made the assumption that no node from W switches 
from propose to recover before switching to accept, we need to show that T5 does no expire at any 
node from W in state propose until time tq + 3d. This holds true because 



+ ^ > is H ts + \V + \ — — ^ ±1 ^ ±2 ^ ±4:^ ^ l-q 

It remains to check that in all cases, the obtained quasi-synchronisation point tq occurs no later 
than time tg + Ri/-d — 3d. In Cases 1 and 2a, we have that 



, _tg + ^-3d. 
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In Case 2b, it holds that 



tg + 



A, + M + 1 
Ri 



T1 + T2 + T4 + 9d 



T1 + T2 + T4 + 9d 



i9 



3d. 



We conclude that indeed all nodes in W switch to accept within a window of less than 3d time 
before at any node in W, Ri expires and it leaves state resync, concluding the proof. □ 



Finally, putting together our main theorems and Lemma 3.4, we deduce that the system will 
stabilize from an arbitrary initial state provided that a subset of n — / nodes remains coherent for 
a sufficiently large period of time. 



1.247 as given in Lemma 3.4' Let W C V, where 



Corollary 4.14. Suppose that ^ < "i^max 
\W\ > n — f, and define for any A; € N 

Tik) := {k + 2)(i?(i?2 + 3d) + 8(1 - \)R2 + d) + Ri/^. 

Then, for any k £ N, the proposed algorithm is a {W, W'^) -stabilizing pulse synchronization protocol 
with skew 2d and accuracy bounds (T2 + Ts)/?? — 2d and T2 + T4 + 7d stabilizing within time T{k) 
with probability at least 1 — l/2''^"'^^\ It is feasible to pick timeouts such that T{k) G 0{kn) and 
T2 + T4 + 7deO{l). 



Proo f The satisf iability of [Condition 3.3| with T{k) G 0{kn) and T2 + T4 + 7d £ 0(1) follows 
from Lemma 3.4 Assume that t^ is sufficiently large for [t^ + T{k) + 2d,t^] to be non-empty, 
as otherwise nothing is to show. By definition, W will be coherent during with t~ = 

X)R2 + d. 



t- + d{R2 + 3d) + 8(1 
resynchronization point tg G [t 



According to Theorem 4.9 



least 1 - l/2'=("-^). If this is the case, 
t G [tg,t- +T{k)]. Applying 



+ (k + l)(^iR2 + 3d) + 8(1 



Theorem 4.13 



Theorem 4.4 



there will be some good W- 
A)i?2 + d)] with probability at 



shows that there is a VK- stabilization point 



inductively, we derive that the algorithm is a {W, E)- 
stabilizing pulse synchronization protocol with the bounds as stated in the corollary that stabilizes 
within time T{k) with probabihty at least 1 — l/2'^("~'^). □ 



5 Generalizations and Extensions 

5.1 Synchronization Despite Faulty Channels 

Theorem 4.13| and our notion of coherency require that all involved nodes are connected by correct 
channels only. However, it is desirable that non-faulty nodes synchronize even if they are not 
connected by correct channels. To capture this, the notions of coherency and stability can be 
generalized as follows. 

Definition 5.1 (Weak Coherency). We call the set C QV weakly coherent during iff for 

any node i £ C there is a subset C C that contains i, has size n — f , and is coherent during 
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In particular, if there are in total at most / nodes that are faulty or have faulty outgoing 
channels, then the set of non-faulty nodes is (after some amount of time) weakly coherent. 

Corollary 5.2. For each k e N let T'{k) := T{k) - ((i?(/?2 + 3d) + 8(1 - A)i?2 + d)), where T{k) 



is defined as in Corollary 4-^4 Suppose the subset of nodes C <Z V is weakly coherent during 
the time interval [t^,t'^] 5 [t^ + T'{k) + T2 + + 8d,t~^] 7^ 0. Then, with probability at least 
1 - (/ + l)/2^("--^), there is a C -quasi- stabilization point t < t + T'{k) + T2 + T4 + 5(i such that 
the system is weakly C -coherent during [t,t~^]. 

Proof. By the definition of weak coherency, every node in C is in some coherent set C C 
of size n — f. Hence, for any such C it holds that we can cover all nodes in C by at most 



l + |y\C"| < / + ! coherent sets Ci, . . . , C/-+i C C. By Corollary 4.14 and the union bound, 
with probability at least 1 — (/ + l)/2^^^^^\ for each of these sets there will be at least one 
stabilization point during [t~,t~ + T'{k) — (T2 + T4 + 5d)]. Assuming that this is indeed true, 
denote by ti^ G [t~,t- + T'{k) - (T2 + r4 + 5d)] the time 

max {max{t <t~ + T'{k) - {T2 + T^ + 5d)\t is a Cj-stabilization point}}, 
«e{i,. ..,/+!} 

where G {1, . • • , / + 1} is an index for which the first maximum is attained and tig is the respective 
maximal time, i.e., ti^ is a Cjp-stabilization point. 

Define t'- S (tjp,t~ -\-T'{k)] to be minimal such that it is another Cjg -stabilization point. Such 



a time must exist by Theorem 4.4 Since the theorem also states that no node from Ci^ switches 
to state accept during [ti^ + 2d, t'-^) and Ci (1 Cig 7^ 0, there can be no Cj-stabilization point during 
(ijg + 2d, t'^^ — 2d) for any z G {1, ...,/ + 1}. Applying the theorem once more, we see that there 
are also no Cj-stabilization points during {t[^ + 2d, t[^ + (T2 + Ta)/??) — 2d for any i E {1,...,/ + 1}. 
On the other hand, the maximality of tig implies that every Ci had a stabilization point by time 



tig. Applying Theorem 4.4 to the latest stabilization point until time ti^ for each d, we see that it 



must have another stabilization point before time tig + T2 + T4 + 5d. We have that 

2(T2 + Ts) , P} T2 + T3 + T5 
^ Z ^ -2dy : y T2 + Ti + 5d, 

i.e., all Ci have stabilization points within a short time interval of {t[^ — 2d, t[^ + 2d). Arguing anal- 
ogously about the previous stabilization points of the sets Ci (which exist because tig is maximal) , 
we infer that all Ci had their previous stabilization point during (tig — 2d, tig + 2d). 

Now suppose ta is the minimal time in {t'-^ — 2d, t[^ + 2d) when a node from C switches to accept 
and this node is in set Ci for some i S {1, ...,/ + 1}. As usual, there must be at least / + 1 non- 
faulty nodes from Ci in state propose at time ta and by time ta + d all nodes from Ci will be in either 
of the states propose or accept. As |Ci R Cj| > / + 1 for any j G {1, ...,/ + 1} all nodes in Cj will 
observe / + 1 nodes in state propose at times in {ta, ta + 2d). We have that ta > tig + (T2 +T3)/'!? — 2(i 



according to Theorem 4.4 As no nodes switched to state accept during (tig + 2d, ta) and none of 



them switch to state recover (cf. Theorem 4.4), it follows from the Inequality 



(T2 + n)/!} - 4d f r2 + 2Ti f r2 + 5d 



that all nodes from Cj observe themselves in one of the states ready or propose at time ta- Hence, 
they will switch from ready to propose if they still are in ready before time ta + 2d. Less than d 
time later, all nodes in Cj will memorize Cj in state propose and therefore switch to accept if not 
done so yet. Since j was arbitrary, it follows that ta is a C-quasi-stabilization point. □ 
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Corollary 5.3. Suppose C is weakly coherent during [t , f^] and t €z [t ,t~^ — (T2 + T4 + 8d)] is a 
C- quasi- stabilization point. Then 

(i) all nodes from C switch to accept exactly once within [t,t + 3d) and 

(a) there will he a C -quasi- stabilization point t' £ [t + (T2 + T^)/^,t + T2 + T4 + Sd) satisfying 
that no nodes switch to accept in the time interval [t + 2>d, t') 



(Hi) and each node i's, i G W, state of the basic cycle l[Figure 1 ) is metastability-free during 
[t + 4d, t' + M) 



Proof. Analogously to the proofs of Theorem 4.4 and Corollary 5.2| 



□ 



We point out that one cannot get stronger results by the proposed technique. Even if there are 
merely / + 1 failing channels, this can e.g. effectively render a node faulty (as it may never see n — f 
nodes in states propose or accept) or exclude the existence of a coherent set of size n — f (if the 
channels connect / + 1 disjoint pairs of nodes, there can be no subset of n — / nodes whose induced 
subgraph contains correct channels only). Stronger resilience to channel faults would necessitate 
to propagate information over several hops in a fault-tolerant manner, imposing larger bounds on 
timeouts and weaker synchronization guarantees. 



Combination of Corollary 5.2 and Corollary 5.3 finally yields: 



emma 



~3J\ Let C QV be such that, 
■ IJi^c^i- ^^sn the proposed 



Corollary 5.4. Suppose that ■& < t?max ~ 1.247 as given in^ 

for each i £ C, there is a set Ci C C with \Ci\ = n — f , and let E 
algorithm is a {C, E)- stabilizing pulse synchronization protocol with skew 3d and accuracy bounds 
(T2 + T^)/'& — 3(i and T2 + T4 + 8d stabilizing within time T{k) + T2 + T4 + 5(i with probability at 
least 1 - (/ + l)/2''("--^), for any /c € N. 



Proof. Analogously to the proof of Corollary 4.14 



□ 



5.2 Late Joining and Fast Recovery 

An important aspect of combining self-stabilization with Byzantine fault-tolerance is that the sys- 
tem can remain operational when facing a limited number of transient faults. If the affected 
components stabilize quickly enough, this can prevent future faults from causing system failure. 
In an environment where transient faults occur according to a random distribution that is not too 
far from being uniform (i.e., one deals not primarily with bursts), the mean time until failure is 
therefore determined by the time it takes to recover from transient faults. Thus, it is of significant 
interest that a node that starts functioning according to the specifications again synchronizes as 
fast as possible to an existing subset of correct nodes making a quasi-stabilization point. Moreover, 
it is of interest that a node that has been shut down temporarily, e.g. for maintenance, can join the 
operational system again quickly. 

In the presented form, the algorithm suffers from the drawback that a node in state recover may 
be caught there until the next good resynchronization point. Since Byzantine faults of a certain 
pattern may deterministically delay this for 0(n) time, we would like to modify the algorithm in a 
way ensuring that a non-faulty node can synchronize to others more quickly if a quasi-stabilization 
point is reached. 

This can be done in a simple manner. Whenever a node switches to state none, it stays until 
a new timeout {Ri,none) expires. When switching to none, it switches also to passive, resets its 
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join and sleep — )• waking flags, and repeats to reset its sleep — )• waking flags whenever a timeout 
of (-i? — + 2)ri + i95d expires. Thus, it will not switch to state active because of outdated 
information. On the other hand, it will not miss the next occurrence of a set C, that are weakly 
coherent since a C-quasi-stabilization point, switching to state sleep — )• waking within a time window 
of (1 — !/'&)('& + 2)Ti + 'd5d, as it will reset its sleep — )• waking flags at most once in this window, 
whereas \C\ > n — f > 2f. Subsequently, it will switch to state join at an appropriate time to enter 
the basic cycle again at the occurrence of the next C-stabilization point. Since nodes refrain from 
leaving state none for a constant period of time only, this way stabilization time in face of severe 
failures can still be kept linear, while in a stable system, nodes recovering from faults or joining 
late stabilize in constant time. 

Corollary 5.5. The pulse synchronization routine can be modified such that it retains all shown 
properties, E3 increases by a constant factor, and it holds that, for any node i in V , if there is a 
C-quasi-stabilization point at some time t < t~ , so that C is weakly coherent during [t,t~^], and 
(C U {i})-coherent during [t~,t~^], then there exists a (C U {i})- quasi- stabilization point at some 
time t' <t~ -\- 0{1), so that (C U {i}) is weakly coherent during [t'jf^]. 

Proof Sketch. Essentially, the fact that n—f nodes continue to execute the basic cycle narrows down 
the possibilities in the proof of Theorem 4. 13| to Case 2b-II, where the threshold for leaving state join 



will be achieved close to the next C-stabilization point due to the involved threshold conditions. 
Since the nodes in C execute the basic cycle, they are not affected by the re-synchronisation 
subroutine at all. Thus, v stabilizes independently of this subroutine provided that it resets its 
join and sleep — )• waking flags in an appropriate fashion. We explained above how this is done and 
why a consistent reset of the sleep — )• waking flags is achieved. The join flags are not an issue since 
at most n — |C| < / channels can attain state join. As a node switches to state none again in 
constant time whenever it leaves, the node will stabilize in constant time provided that there is a 
C-quasi-stabilization point from where on C is weakly coherent until time t^ . On the other hand. 



we can easily adapt the re-synchronisation subroutine, Lemma 4.8 Theorem 4.9, and Condition 3.3 
to allow for the additional time nodes are non-responsive with respect to the re-synchronisation 
subroutine, increasing by a constant factor only. □ 



5.3 Stronger Adversary 

So far, our analysis considered a fixed set C of coherent (or weakly coherent) nodes. But what 
happens if whether a node becomes faulty or not is not determined upfront, but depends on the 
execution? Phrased differently, does the algorithm still stabilize quickly with a large probability if 
an adversary may "corrupt" up to / nodes, but may decide on its choices as time progresses, fully 
aware of what happened so far? Since we operate in a system where all operations take positive 
time, it might even be the case that a node might fail just when it is about to perform a certain 
state transition, and would not have done so if the execution had proceeded differently. Due to the 
way we use randomization, this however makes little difference for the stabilization properties of 
the algorithm. 

Corollary 5.6. Suppose at every time t, an adversary has full knowledge of the state of the system 
up to and including time t, and it might decide on in total up to f nodes ( or all channels originating 
from a node) becoming faulty at arbitrary times. If it picks a node at time t, it fully controls its 
actions after and including time t. Furthermore, it controls delays and clock drifts of non-faulty 
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components within the system specifications, and it initializes the system in an arbitrary state at 
time 0. For any k gN, define 



tk ■= 2{k + 2){^{R2 + 3d) + 8(1 - X)R2 + d) + Ri/d + T2 + T^ + 5d. 

Then the set of all non-faulty nodes have reached a quasi- stabilization point by time T[k) from 
where on they are weakly coherent, with probability at least 



-fc(n-/)/2 



Proof. We need to show that Theorem 4.9 holds for the modified time interval [t, t + (fc + 2)i?3] 
with the modified probability of at least 1 — e"*^^""-^^/^. If this is the case, we can proceed as in 
Corollaries 15.21 and 15.31 

We start to track the execution from time 0. Whenever a node switches to state init at a good 
time, the adversary must corrupt it in order to prevent subsequent deterministic stabilization. In 



the proof of [Theorem 4.9 we showed that for any non- faulty node, there are at least k + 1 different 



times until 2??(/c + 2)E^ when it switches to init that have an independently by 1/2 lower bounded 



probability to be good. Since Lemma 4.8 holds for any execution where we have at most / faults, 
the adversary corrupting some node at time t affects the current and future trials of that node only, 
while the statement still holds true for the non-corrupted nodes. Thus, the probability that the 
adversary may prevent the system from stabilizing until time tk is upper bounded by the probability 
that {k + l)(n — /) independent and unbiased coin flips show / or less times tail. Chernoff 's bound 
states for the random variable X counting the number of tails in this random experiment that for 
any 6 G (0,1), 

p[x < (1 - 5)nx\] < ((T^^J < 

Inserting 6 = k/[k + 1) and E[X] = {k + ^){n — /)/2, we see that the probability that 

P[X <f]< P[X <{n- f)/2] < e-^(«-/)/2, 
as claimed. □ 



6 Implementation Issues 

In this section, we briefly survey some core aspects of the VLSI implementation of the pulse syn- 
chronization algorithm, which is currently being developed. Thereby we focus on the three major 
building blocks: (1) asynchronous state machines, (2) memory flags with thresholds and (3) watch- 
dog timers. 

The pulse synchronization algorithm at every node consists of several simple state machines that 
execute asynchronously and concurrently. There are several types of conditions that can trigger 
state transitions: 

(i) The state machines of a certain number (1, > / + 1, or > n — /) of remote nodes reached 
some particular state, indicated by memory flags. 

(ii) Some local state machine reached a particular state. 
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(iii) A watchdog timer expires. 



These conditions may also be combined (using AND or OR). 

We will employ standard Huffman-type asynchronous state machines [25] for implementing our 
state machines, as they fit nicely to the 0-Model already used in DARTsj^ Analyzing the transition 
conditions of all the five state machines (Figures [2| [s] and [4]) of a single node reveals that we need 
to communicate six different states {recover, accept, join, propose, sleep — )• waking and "other") of 
the core state machine (Figure 2) and two states each [supp, none and init, wait) for the two state 
machines making up the resynchronization algorithm from every node to every node. There are 
several possibilities for implementing this communication. For example, both a simple high-speed 
serial protocol and a parallel five bit bundled data bus with a strobe signal are viable alternatives, 
each offering different trade-offs between implementation complexity, speed, area consumption, etc. 

We note, however, that any method for communicating states is complicated by the fact that 
state occupancy times may be very short in an asynchronous state machine: Reaching a state must 
always be faithfully conveyed to all remote nodes even if it is almost immediately left again. In 
addition, the core state machine may undergo various sequences of state transitions, implying that 
we cannot use a state encoding where only a single bit changes between successive states. Care 
must hence be taken in order not to trigger hazardous intermediate state occupancies at the receiver 
when communicating some multi-bit state change. Both problems can be handled using suitable 
bounded delay conditions. 



Remote Memory Flags and Thresholds 



Figure 5 shows the principle of implementing remote memory flags, which are the basic mechanism 
required for type (i) state transition conditions at node i. For every remote node j, it consists 
of a hazard- free demultiplexer that decodes the communicated state of node j's state machines, a 
resettable memory flag per state that remembers whether node j has ever reached the respective 
state since the most recent flag reset, and optionally a threshold module that combines the cor- 
responding flag outputs for all remote nodes. Note that every memory flag is implemented as a 



(resettable) Muller C Gatf here, but could also be built by using a flip-flop. 

Implementing local state input transition conditions (ii) is pretty much straightforward, as one 
simply needs to incorporate (single) state signals from local state machines here. Note that every 
transition condition comprises the node observing itself in a particular state, which also falls into 
this category. To avoid metastable upsets in the asynchronous state machine (see below), it may 
be necessary to add memory flags for local signals as well. 



Watchdog Timers 

Our implementation of the watchdog timers, which are required for realizing state transition con- 
ditions (iii), will rest upon a single local clock generator (we will use a simple ring oscillator, i.e., 
a single inverter with feedback and a prescaler) per node that drives all watchdog timers, instead 
of a crystal oscillator, because of the possibility to integrate it on-chip. However, the oscillator 
frequency of such ring oscillators vary heavily with operating conditions, in particular with supply 

^®The 0-Model assumes that we can enforce a certain ratio between slowest and fastest end-to-end delay along 
critical signaling paths. 

Muller C Gate retains its current output value when its inputs are different, and sets its output to the common 
input otherwise. 
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Figure 5: Implementation principle of remote memory flags and thresholds. 



voltage and temperature, as well as with process conditions. The resulting (two-sided) clock drift 
^ (with respect to supply voltage, temperature and process variation) is typically in the range of 
7% to 9% for uncompensated ring oscillators and can be lowered down to 1% to 2% by proper 
compensation techniques [32j. The two-sided clock drifts map to 'd = {1 + C)/(l ~ bounds of 
1.15 to 1.19 and 1.02 to 1.04, respectively. Recalling from Lemma 3.4| that T?max ~ 1.247, one sees 
that both uncompensated and compensated ring oscillators are suitable for implementation of the 
pulse synchronization protocol's watchdog timers. However, care must be taken when the protocol 
is used to stabilize darts: to compensate a typical drift of 15% of darts clocks, one must ensure 



that "i? is smaller than roughly 1.064 (cf. Section 7). Thus, here, only compensated ring oscillators 



are sufficiently accurate. Note, however, that these are conservative bounds, assuming that the 
synchronization protocol and DARTS drift into different directions. Considering that a large share 
of the drift in both systems is due to variations in temperature, it seems reasonable to assume that, 
in the long term, both drift into the same direction. 



As shown in Figure 6, every watchdog timer consists of a resettable up-counter and a timeout 



register, which holds the timeout value. A comparator compares the counter value and the timeout 
register after every clock tick, and raises a stable expiration output signal if the counter value is 
greater or equal to the register value. The asynchronous reset of the counter, which also resets the 
timeout output signal, is used to re-trigger the watchdog. 
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Figure 6: Implementation principle of watchdog timers. 
As for the watchdog timer with random timeout R3 in the resynchronization algorithm, the 
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simplest implementation would load a uniformly distributed random value into the timeout reg- 
ister whenever the watchdog is re-triggered. Depending on the implementation technology, such 
random values can be generated either via true random sources (thermal noise) or pseudo-random 
sources (LFSRs) clocked by another ring oscillator. If we could guarantee that the content of the 
timeout and the random source can, by no means, read or probed somehow by anybody, such 



an implementation satisfies the model requirements Alternatively, one could use random sam- 
pling per clock tick, which avoids storing the future timeout value and also converges to uniformly 
distributed timeouts for sufficiently large values of R3. 

Combined State Transition Conditions 

Combining different state transition conditions (i)-(iii) via AND/OR requires some care, since an 
asynchronous state machine requires stable input signals in order not to become metastable during 
its state transition. Combining several conditions (i) does not cause any problems, since the memory 
flags ensure that all outputs are stable. Non-stable signals, like "Ti AND < n — f accept" require 
sampling via a flip-flop clocked by a stable signal. For example, the status of<n — / = -i>n — / 
is sampled when the signal reporting expiration of Ti is issued. Similarly, it might happen that 
conditions requiring conflicting state transitions are satisfied at the same time, e.g., (T2, accept) 
might expire simultaneously with the threshold of "> / -|- 1 recover or accept" being reached. 
Obviously, both of the above situations could create a metastable upset, either of the sampling 



flip-fiop, or directly of the register(s) holding the node's state. Fortunately, Theorem 4.4 revealed 
that this can happen during stabilization only. In regular operation, e.g. the critical threshold of > 
n—f accept is always reached before Ti expires. Thus, the former is acceptable, as metastable upsets 
occur rarely and increase convergence time only. Moreover, to further decrease the probability of a 
metastable upset that might affect stabilization time, it is perfectly feasible to insert a synchronizer 
or an elastic pipeline after the sampling flip-flop for capturing metastability [13]. This additional 
precaution merely increases the latency by a constant delay, which due to being restricted to the 
pulse synchronization component will not adversely affect the final precision and accuracy of the 
stabilized darts clocks. 



7 Coupling of DARTS and Pulse Synchronization Algorithm 

In this section, we describe how the self-stabilizing pulse synchronization protocol could be coupled 
with DARTS clocks. As this requires certain implementation details, we also sketch some ideas 
that might be used in a prototype implementation. The joint system provides a high-precision 
self-stabilizing Byzantine fault-tolerant clocking system for multi-synchronous GALS. 

The coupling between the pulse synchronization protocol and darts clocks involves two direc- 
tions: 

1. The pulse synchronization protocol primarily monitors the operation of the DARTS clocks. 
As long as darts ticks are generated correctly, it must not interfere with the darts tick 
generation rules at all. 

^'^Note that in practice this is a reasonable assumption, as even the node itself does not access this value except 
for checking whether the timer expired and the computational power of the system is very limited. 
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2. If DARTS clocks become inconsistent w.r.t. the behavior of the pulse synchronization protocol, 
the latter must interfere with the regular darts tick generation, possibly up to resetting 
DARTS clocks. 



To assist the reader, we first provide a very brief overview of the original DARTS and its imple- 
mentation. 



7.1 DARTS Overview 



DARTS clocks (called TG-Algs in the sequel) are instances of a simple synchronizer [35] for the 
0-Model based on consistent broadcasting [31]. They generate ticks Tick (0), Tick (1), Tick (2), 
. . . approximately simultaneously at all correct nodes. Since actual darts ticks are just binary 
clock signal transitions, which cannot carry tick numbers, the original algorithm had to be modified 



significantly in order to be implementable in asynchronous digital logic. Figure 7 shows a schematic 
of a single TG-Alg for a 5-node system. 
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Figure 7: Schematic of darts TG-Alg Implementation 



Key components of a TG-Alg are counter modules, one per remote TG-Alg, which just count the 
difference between the number of ticks generated locally and remotely. They are implemented using 
a pair of elastic pipelines [33], which implement FIFO buffers for signal transitions. Matching ticks 
in both pipelines, which are obviously irrelevant for the difference, are removed by the connecting 
Diff-Gate. The status (> 0, > 0) of all counter modules is fed into two threshold modules, whose 
outputs trigger the generation of the next local tick. A detailed discussion of the implementation 
can be found in [TO] . 

The correctness proof and performance analysis in |16| \T7\ IT5] revealed that correct TG-Algs 
indeed generate synchronized clock ticks, in the presence of up to / Byzantine faulty TG-Algs 
in a system with n > 3/ + 2 nodes: For any two correct nodes p, q, the number of clock ticks 
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generated by p and q by time t do not differ by more than a (very small) constant tt, and the 
frequency of any correct clock (and thus the maximum drift p) is within a certain range. In 
addition, expressions (in the order of tt) for the maximum size of the elastic pipelines in the counter 
modules were established, which guarantee overflow-free operation. Experiments with both FPGA 
and ASIC prototype implementations demonstrated that DARTS clocks indeed offer close to perfect 
synchronization and very reliable operation. 

Nevertheless, as already mentioned, (almost) simultaneous start-up of all TG-Algs and at most 
/ failures during the whole life-time of the system are mandatory preconditions for these results to 
hold. DARTS neither supports late joining or recovery of TG-Algs, nor recovery from more than / 
failures. 

7.2 Required Extensions for Coupling DARTS and Pulse Synchronization 

The major obstacle for supporting late joining of TG-Algs, removing spuriously generated ticks 
in the pipelines etc. are the anonymous clock ticks used in DARTS: Since they are just signal 
transitions on a single wire, they cannot encode any information except their occurrence time. The 
most important extension of DARTS is hence to add an additional bundled data wire to the clock 
signal, which carries 1 bit of data. This way, single ticks can be marked with a 1, distinguishing 
them from ordinary non-marked ticks that carry a 0. 

We will actually mark every T-th DARTS tick, for some suitably chosen T. Such a marked tick 
/cT, A; > 0, is to be understood as the start of the [k + l)-st DARTS round, which consists of the 
marked DARTS tick kT and T — 1 subsequent unmarked ticks kT + 1, kT + 2, . . . , (fe -|- 1)T — 1; the 
marked tick (fc + 1)T starts the next DARTS round. Note that the resulting DARTS ticks can be 
interpreted as a discrete, bounded clock operating modulo T. As DARTS rounds at any two correct 
TG-Algs are synchronous, marked ticks must always match in the pipelines of every counter, i.e., 
the Diff-Gate must always remove pairs of matching marked (or non-marked) ticks and can hence 
detect and remove any inconsistency. 

The actual coupling between the instance of the pulse synchronization protocol and the DARTS 
clock running at node i is accomplished by means of two signals, namely, DARTSj and PULSEj: 

• DARTSj reports DARTS rounds to the pulse synchronization protocol. The rising edge of the 

DARTSj signal, which may trigger a switch from ready to propose] is issued when the DARTS 
clock of node i generates tick kT — X, for some fixed X < T. The falling edge of DARTSj 
reports the occurrence of the marked tick kT. 

• PULSEj reports the generation of a pulse to the darts clock. Its rising edge is issued on the 
transition to accept, and its falling edge signals the expiration of a fixed timeout Ty that is 
reset at the time the rising edge is transmitted. 

The basic idea underlying the coupling of the pulse synchronization protocol and DARTS is to 
align marked ticks and pulses as follows: If the system operates normally, every correct node i first 
reaches some DARTS tick kT — X and issues DARTS^ = 1. Next, a pulse is generated at node i 
by the the pulse synchronization protocol, which thus sets PULSEj = 1. Subsequently, the DARTS 
marked tick kT occurs, which is signaled by DARTSj = 0. Finally, the pulse timeout Ty and hence 
PULSEj = occurs. Normal operation thus expects that the DARTS marked tick (= the falling edge 
of DARTSj) occurs within the time window where PULSEj is 1. Provided that the timeout used for 
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generating this windo\\^^ is chosen sufficiently large, namely, 'dp{-K + 2d+ 1), this interleaving can 
indeed be guaranteed in normal operation. 

As long as this is the case, we just let darts generate its ticks using its standard rules. Should 
a DARTS clock fail, however, such that PULSEj and DARTSj are not properly interleaved, then we 
will force marking the next DARTS tick (and possibly resetting the TG-Alg, if needed) upon the 
falling edge of PULSEj. darts ticks and pulses (as well as marked darts ticks at different nodes) 
will hence only be re-aligned in case of errors or desynchronization: As long as darts clocks work 
correctly, any two correct TG-Algs will mark tick kT within the DARTS synchronization precision. 

Provided that X and Ty are suitably chosen, it is not difficult to prove that the joint system 
consisting of pulse synchronization protocol and DARTS clocks will stabilize: After some unsta- 
ble period, the pulse synchronization algorithm will stabilize, which we have proved to happen 
independently of the (non-)operation of darts clocks. When the pulse synchronization protocol 
eventually starts to generate synchronous pulses, the darts clocks will start to recover in a guided 
(synchronized) manner. When all correct DARTS clocks are eventually synchronized to within the 
intrinsic DARTS precision, the system will perpetually ensure the above interleaving at all correct 
nodes. 

Some additional observations: 

(1) Since the darts precision is typically considerably smaller than the worst case pulse synchro- 
nization precision, the underlying DARTS clocks may be viewed as a "precision amplifier" (as 



well clock multiplier, see Section 7.2^. 



(2) There is no need to specify properties possibly achieved by darts clocks during their own 
recovery. We only require that they eventually reach full synchronization in the presence of 
synchronous pulses at all correct nodes. In practice, darts clocks will typically also gradually 
improve their synchronization precision during this interval. 

(3) Although the pulse synchronization algorithm stabilizes even when the darts clocks behave 
arbitrary, it is nevertheless the case that it achieves better pulse synchronization precision 
when the DARTS clocks are fully synchronized. 

(4) One might ask why we did not just use the A;-th rising edge of PULSEj to mark the very next 
DARTS tick generated by the TG-Alg at node i. This simple solution has several major draw- 
backs. First, the pulse synchronization precision is typically worse than the synchronization 
precision provided by darts. Thus, every pulse would result in a temporary deterioration 
of the DARTS synchronization quality. Second, marked ticks are not necessarily generated 
exactly every T darts ticks. And last but not least, since darts clocks and pulse synchro- 
nization execute completely asynchronously, marking darts ticks at pulse occurrence times 
would create the potential of metastability every kT darts ticks, even if there is no failure 
at all. 



7.3 DARTS =^ pulse synchronization 

To implement this part of the coupling, every darts clock signals the upcoming occurrence of 
marked tick kT to its local instance of the pulse synchronization protocol. This is accomplished 
by the rising edge of DARTSj, the dedicated darts signal, which is generated upon darts tick 

^^We remark that it is vital not to rely on the DARTS clock here. 
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KT — X. If all correct nodes happen to do this within some time window when they are (w.r.t. 
the pulse algorithm) in state ready with T3 < T4 already expired, all correct nodes will switch to 



state propose within vr time Subsequently, they will all switch to state accept within d time. To 
make sure that indeed all correct nodes are in state ready with Ts already expired, up to small 
additional terms of 0{d), we must choose the minimal duration of a darts round to be larger 
than T2 + T3 + 4:d, while (T2 + T4)/'(} is to exceed its maximal duration. Assuming that p < 1.15, 



Lemma 3.4 shows that this is feasible up to ?? ~ 1.064, which is clearly within the reach of ring 



oscillators 32 
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7.4 Pulse synchronization =^ DARTS 

This part of the coupling between darts and the pulse synchronization protocol requires two 
mechanisms: 



(1) A way to force a marked darts tick at node i upon the occurrence of the falling edge of 
PULSEj, provided that no marked tick (i.e., the falling edge of DARTSj) has been generated 
while PULSEj was 1. This may also include recovering from a complete stall of the darts tick 
generation. 

(2) A way to recover accurate DARTS synchronization after forcing marked ticks, which may also 
include the need to get rid of any information from the preceding unstable period. 

To achieve (1), we use a simple asynchronous circuit that supervises the interleaving of PULSEj 
and DARTSj, and generates a "force marking" signal if DARTSj does not occur in time. Note that 
this device can be built in a way that entirely avoids metastability in case of normal operation. In 
an unstable period, however, it may happen that force marking occurs exactly at the time when 
DARTS generates its marked tick, so a metastable upset or two very close marked ticks (a forced 
and a regularly generated one) are possible. 

There are several variants for implementing the forced marking itself, including the simplest 
variant of just resetting the TG-Alg in order to generate marked tick 0. The need for possibly 
resetting a TG-Alg originates from the fact that stateful TG-Alg components may deadlock due 
to earlier failures. For example, a deadlocked pipe will never propagate ticks from its input to 
the Diff-Gate. Unfortunately, resetting TG-Algs complicates darts recovery considerably: If a 
TG-Alg reset would also reset the remote pipes of its counters, it might lose "fresh" marked ticks 
generated by remote TG-Algs. Hence, remote pipes should only be reset when the remote node is 
reset. However, since a remote node might never observe a discrepancy between darts rounds and 
pulses, this approach might end up in the pipe not being reset at all. This is problematic as it might 
effectively render the node faulty despite all its components being operational. Luckily, we may 
utilize the fact that solving (2) under the assumption that correct pipes are not deadlocked yields a 
trivial means to distinguish a locked pipe from an operational one: If the Diff-Gate cannot remove 
any ticks within a certain time interval after a (correct) pulse, the pipe must have deadlocked and 
can safely be reset. At the next pulse, all pipes will have recovered from previous deadlocks and a 
solution to (2) assuming deadlock-free pipes will succeed. 

^^In contrast to the model we employed for our analysis, we neglect the local signaling delay here, as it is smaller 
than the time to generate a single tick. 

^''This is true regardless of the additional term of 0{d), as the bound is derived from an asymptotic statement. 
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To explain how we achieve (2), we start with the observation that our way of marking every 
T-th tick impUes that, for any two correct DARTS TG-Algs, it will always be a marked tick kT from 
a remote node that is matched by the local marked tick kT in every counter of Figure 7 when the 
Diff-Gate removes it. That is, if ever a marked tick is matching a non-marked tick in a counter, 
ticks have been lost or spuriously generated somewhere, or local and remote node are severely out 
of synchrony. 

Assume for the moment that we could generate exactly one marked tick at every correct node, 
we made sure that no such tick is in the system before this happens, and that we have elastic 
pipelines of infinite size. The following simple strategy would eventually establish matching darts 
ticks: Whenever a Diff-Gate encounters a marked tick in one pipe matched by an ordinary tick 
in the other, it removes the ordinary tick only. At the pipe level, this rule implies that whatever 
the state of the pipes was before the marked ticks were generated, they will be cleared before the 
matching pair of marked ticks is removed. Since all darts tick generation rules ensure that no 
TG-Alg generates any tick kT + 1, kT + 2, . . . based on information from the previous darts round 
k — 1 (consisting of darts ticks up to kT — 1) all counter states will be valid as soon as the matching 
marked ticks kT have been removed. As darts essentially generates ticks based on comparing the 
number of locally and remotely generated ticks, this is enough to ensure stabilization of the darts 
system; full DARTS precision will be achieved quickly because nodes "catch up", i.e., generate tick 
numbers that at least / + 1 correct darts clocks already reached, faster than "new ticks", i.e., ones 
that no correct node generated yet, may occur. 

The issue of finite-size pipes is (largely) solved by the pulse synchronization protocol: Pulses and 
hence marked ticks are generated close to each other, in a time window of at most 2d + Ty G 0{d) 
(provided that Ty is not unnecessarily large). Hence, apart from implementation issues, pipes that 
can accommodate all ticks that may be generated within this time window are sufficient for not 
losing any valid DARTS tick. 

In reality, however, we cannot always expect the "single marked tick" setting described above: 
Elastic pipelines may initially be populated with arbitrarily many marked ticks from the unstable 
period. We must hence make sure that all these marked ticks (and the white ticks in between) 
are eventually removed, and that we do not generate new marked ticks close to each other. The 
pulse synchronization protocol will ensure that forced ticks are separated by T darts ticks, and 
our implementations of (1) and (2) will ensure with a large probability that a forced marked tick 
will not be generated close to a marked tick generated regularly by darts. Under these conditions, 
it is a relatively easy task to clear all superfluous marked ticks between pulses. 

For example, we could asynchronously reset the whole data flip-flop chain that holds the mark- 
ings of the ticks (not the ticks themselves!) currently in a pipe shortly after the rising flank 
of PULSEj. Enlarging X and Ty slightly, we can be sure that all TG-Algs will remove spurious 
markings from their pipes before any marked tick associated with the respective correct pulse is 
generated. Although this could generate metastability in the Diff-Gate, namely, when the tick at 
the head of the pipe is a marked tick and the Diff-Gate is about to act when the pipe is reset upon 
arrival of a new marked tick arrives, this cannot happen during normal operation. 
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